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PREFACE 


Tue present century has seen extensive developments in 


methods of measuring scientifically the traits and abilities 
of human beings. Three main influences may be discov- 
ered behind this movement. First, there is the tremendous 
interest which people have always shown in judging one 
another's qualities. Much of their conversation consists in 
summing up their acquaintances. "Their plans, either for 
work or play, are very largely guided by their notions 
regarding the persons with whom they are about to deal. 
It is obvious that such notions are frequently biased or 
inaccurate, and yet each one continues to rely, in great 
affairs or small, chiefly on his own insight into the charac- 
ters of others. However, it is recognized that in certain 
spheres of human activity, more accurate and impartiat 
assessments are essential. The examination system, now 
firmly established in schools, universities, and in many 
professions, is supposed to measure the achievements and 


capacities of pupils, of students, and of candidates for' 


various jobs. Although this system is certainly an advance 
on the earlier system of personal judgments, yet rüuch 
doubt exists as to its efficiency. А considerable section of 
this book is therefore devoted to a discussion of the defects 
of examinations, and of present-day methods of marking 
pupils! ог students' work, and to а study of how these 
may be improved. 

Secondly, mankind is interested, not only in individuals, 
but also in general problems of human nature and human 
institutions. Thus another field where accurate measures 
of human qualities are needed is the scientific investigation 
of mind, behaviour, and society. Loose thinking and 
prejudice are extremely prevalent in matters of politics, 
economics, religion, crime, and. the like—as i is shown, for 
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example, by Thouless in his book Straight and Crooked 
Thinking (1980). Yet the social Sciences, such as psycho- 
logy and Sociology, are endeavouring to study these sub- 
jects rationally and impartially, in the same manner as the 
physicist studies the atom, or the physiologist the workings 
ofthe body. Now the data from which a science is built 
Up are very largely numerical and quantitative, The 
records of a chemical or physical experiment usually соп- 
sist of measures of weights, times, distances, galvanometer 
or thermometer readings. Measurement is almost equally 
important for Purposes of precise description and experi- 
ment in the social sciences ; &nd the deduction of sound 
conclusions from such quantitative data frequently in- 
volves the application of statistical treatment. In this 
connection, then, we shall consider below the nature of 
and the elements of statistical 
тепсе to the field of education. 
nial testing movement lies in 
erment. Medical seience has 
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This book is thus intended primarily for three classes of 
students, namely teachers and examiners, psychologists 
and others who set out to investigate human phenomena 
from the scientific standpoint, and doctors and psycholo- 
gists who wish to make practical use of tests in clinics and 
schools. Yt makes no attempt to describe the various 
mental tests (though a classified list of them is included in 
Chap. ІХ), since many good accounts of the main ones are 
already available, e.g. by Knight (1933), Blackburn (1939), 
Oakley and Macrae (1937). Most authors, however, seem 
to have emphasized the value of tests without sufficiently 
drawing attention to their dangers and difficulties. Thus 
the main object of this book is to outline the principles of 
test construction and application “г procedure, and the 
interpretation of the results of tests and examinations, 
ignorance of which on the part of testers has frequently led 
to serious errors. Mental measurement undoubtedly needs 
a high degree of skill, which few of the persons just men- 
tioned have either the time or the opportunity to acquire 
thoroughly, before they set out to test or examine. 

There are aiso available many excellent textbooks on 
statistical methods, perhaps the most useful to testers 
being those of Garrett (1987), Dawson (1988), and Guilford 
(1936). Yet in the present writer’s experience, all of these 
are too difficult for the majority of students whose psycho- 
logical training only amounts to between seventy-five and 
one hundred and fifty hours. This books aims, therefore, 
to describe the minimum essentials, to show their imme- 
diate application to psychological (and especially to educa- 
tional) problems, and to stress their general significance 
rather than their technical details. 

Моге advanced students are unlikely to find within much 
that is original. But, so far as the writer knows, no 
previous account of the statistics of marking, nor of the 
construction of new-type examinations, has been published 
by а British author. The analysis of educational measure- 
ment and the discussion of the technique of examining are 
partly new but owe a great deal to Hamilton’s brilliant „ 
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book, The Art of I nlerrogation (1929). Possibly also some 
of the practical hints about tests and testing have not 
been published before. 

Lack of space has necessitated some important omissions, 
such as the whole field of temperament and character 
assessment, the testing of special (vocational) aptitudes, 
and of sensory-motor capacities. Actually few of the tests 
falling under these headings are well standardized, and 
fewer still can safely be applied by those who are not 
trained psychologists. "The questionnaire, the interview, 
observational techniques, and the application of scientific 
methods to educational and social-psychological experi- 


ments, are other topies which may, it is hoped, be dealt 
with in a later volume. 


I wish to express my thanks to Mr. B. Babington Smith 
and to Mr. P, Kemp for reading parts of the manuscript, 
and for giving valuable suggestions and criticisms. Iam 
also especially indebted to my wife for her advice on 
points connected with teaching, and for her help in the 
preparation of the manuscript. 

Я Р. Е. У. 


‚Тик Uxivrnsimy, 
Grasaow, 
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CHAPTER I 


INTRODUCTION TO STATISTICAL METHODS 


IT is an unfortunate, yet indubitable, fact that the major- 
ity of students of psychology and education find extreme 
difficulty in grasping and applying the statistical con- 
cepts which are an essential prerequisite to sound mental 
measurement. "They take fright at the mere mention of 
statistics or at the sight of statistical symbols and formulz. 
A. small minority, on the other hand, find in the manipu- 
lations of numbers which we call statistics an intense 
fascination, closely akin to the musical composer’s fascina- 
tion with interweavings of sounds. Probably statistical 
ability is as specialized and as rare as is ability in musical, 
composition, and the expert statistician rather easily loses 
touch with the doubts and difficulties of the beginner. 
He fails to realize that the statistical мау of looking at 
educational or psychological phenomena, which comes so 
naturally to him, is at first entirely foreign to the average 
student. Yet the student's ‘phobia’ is certainly un- 
necessary, since the elementary statistics needed for the 
proper treatment of marks and test scores includes little 
more than the kind of algebra and arithmetic which he 
mastered by the age of 14. Statistics means nothing 
more than numerical data, such as measurements, ex- 
amination and test results; and statistical methods are 
merely the ways of dealing with, or organizing, these data 
so as to bring out their full significance. — — 

The raison d’étre of statistical methods is that human 
beings are apt to differ so widely from one another in any 
and every measurable characteristic, that results or con- 
clusions which apply to one person or one group of persons 
(e.g. one school class), do not necessarily apply to ..nother 
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person or group. First, then, we must have efficient 
techniques of arranging or tabulating the scores and marks 
obtained from different people, in order to see what is the 
general run of their scores, and what is the range or extent 
of the differences. "These techniques are described in the 
present chapter. The ‘general run’ and the “range oi 
extent ’ of scores next require more precise definition and 
measurement ; that is to say, we must know how to work 
out averages and indices of what is called spread, scatter, 
dispersion, or variability (Chap. IIT). 

Not only do human beings differ from one another, but 
also each individual varies in his capacities and character- 
istics from time to time. The score he achieves to-day may 
or may not be typical of his usual accomplishments. The 
same holds true of 2 class of pupils or students. The extent 
of this untrustworthiness or unreliability (as it is generally 
termed) of a single measurement or of а set of measure- 
ments therefore needs to be determined. Statistical 
treatment will tell s whether any alteration or variation 

“in scores represents a real and important change—such a 
change, for example, as we might hope to bring ahout by 
means of the teaching given to the individual or.to the class 
—or whether it is merely a matter of chance fluctuations. 
A further very common and important problem is the 
reality or significance of a difference between two groups 
of pedple. For instance, girls may appear to be better than 
boys at reading; but this is not true of all girls and all 
boys; so we must have a method of establishing how 
general and how reliable is the difference (Chap. V). 

Still another type of variation requires special techniques 
of investigation, namely, the variations in people's capaci- 
ties in different spheres. It is obvious that no one is even 
, all-round in his level of abilities. Yet we are apt to assume 

that high achievement in one field implies high achieve- 
_ ment in other fields. Sometimes also we infer the opposite 
_ type of relationship—that if a person is strong in some 
trait, then he will be lacking in some other trait. The. 
statistical methods of correlation enable us to say how far 
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various traits or abilities * go together’ or are related to 
one another. They also tell us how far our tests or ex- 
aminations measure what they are supposed to measure, 
ie. how valid they are. For example, a Certificate or 
Matrieulation examination at the age of 16-18 may, or 
may not, validly predict success in a subsequent university 
career (Chaps. VI-VIII). 

Although statistical methods are, then, of undoubted 
importance, their limitations should also be recognized at 
this stage in our discussion. They are tools for handling 
numerical data, but they are not capable of improving 
data which may be initially inaccurate or biased. When 
an investigator uses a poor test or examination, or when he 
observes and records people's behaviour or mental pro- 
cesses wrongly, no amount of statistical treatment of his 
results can correct such deficiencies. Further, the funda- 
mentally abstract nature of statistical methods needs to be 
stressed. "They are adapted primarily for mass-investiga- 
tions of measurements obtained from large numbers of 
persons, and such investigations are apt to lose touch with 
the concrete individual case. Without them we could not 
make sound generalizations about human beings, but even 
with them we cannot usually state whether a generalization 
applies to a particular human being. The practical hand- | 
ling and understanding of an actual child or adult is still 
an art rather than a science, though it may be greatly 
assisted by scientific mental tests and guided by the results of 
objective psychological experiments and statistical studies. 


TABULATION OF MEASURES 


Measures and Variables.—By a variable we shall mean 
an ability or some other characteristic in respect of which 
human beings vary or differ from one another. Height, 
age, income, intelligence, examination success, are all vari- 
ables in this sense. A measure is the numerical expression 
of a certain standing on, or amount of, a variable. Fifty 
inches of height, five pounds a week income, and a mark of 
6 out of 10 are examples of measures. 
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Most variables are continuous ; that is to say, they can 
vary by infinitely small gradations. But we only measure 
them to the nearest convenient unit. For example, 
heights may be measured to the nearest tenth of an inch. 
But actually 50-1 inches includes all heights from 50:05 to 
50:15; 50-2 inches includes from 50-15 to 50-25, and so 
on. Similarly, a school mark of 6 is usually an approxi- 
mation representing all degrees of quality from 5} to 61. 
If, however, half-marks are awarded, then 6 represents 53 
to 64, and 6} represents 6} to 6%, and so on. But some 
variables are discrete or discontinuous, since they only vary 
by units. For instance, if the size of a class of pupils is 
40, this does not imply between 393 and 40} pupils. The 
scores for many tests, and some school marks, are of this 
type, for they represent the exact numbers of questions or 
test items correctly answered. Thus a pupil would score 
6 out of 10 for arithmetic sums even if he had nearly 
completed a seventh sum. (Such measures are referred 
to in the next chapter as count or enumeration scores.) 

This distinction between continuous and discrete vari- 
ables needs to be borne in mind when tabulating measures 
(cf. p. 6). 

Tabulation.—Tabulation means arranging a set of 
measures in an orderly fashion, so as to convey the essential 
facts about them in convenient and relatively brief form. 
If We take a class of pupils, and beside each pupil’s name 
write his age, or exam. mark, we get a list, but it is not & 
table, since the measures are quite haphazardly arranged. 
It is very difficult to see from such a list, if it be a long 
one, what is the general run and range of measures, OT 
whether any particular measure (e.g. John Smith’s 6 out 
of 10) is high, low, or average. om 

Ranking.—One way of tabulating is to rearrange all the 
measures in order of size from highest to lowest ; that is, to 
put them in rank order. The desired information can then 
be grasped much more ‘readily. But this is a laborious 
process if the number of pupils is large. ' 

Frequency Distributions.—Under such circumstances it 
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is better to write in a column each possible measure, in 
order, and beside it the frequency, F ; that is, the number 
of persons who obtain each measure. The result is called 
a frequency distribution, since it shows how the various 
measures are allotted or distributed among the persons. 


Ex. 1.— The following is a list of the marks (out of 20) which 
were awarded to a class of 50 pupils. .The table below shows 
their frequency distribution. Marks: 10, 7, 5, 17, 15, 18, 4, 6, 
9, 15, 16, 20, 17, 19, 14, 9, 11, 17, 18, 8, 1, 9, 12, 19, 16, 15, 17, 
14, 13, 10, 6, 20, 15, 18, 17, 13, 11, 4, 16, 13, 8, 14, 18, 15, 12, 
17, 16, 15, 17, 2. 


Measures F 
20 // ° a ; 2 
19 f]. ғ 2 
18 JL] + . 3 

& er s 7 
16 ////. s rd 
15 ҮРҮҮ | „8 
14 /]] > . . . 8 
13 ////- : 4 
12 // • « 2 
1X |] . а = 2 
wy s x 2 
9111 8 
8 // 2 
7] 5 s 
6 // . . 2 e 
"b | . Е 
4 
gu! 0 
21 H 
1/ 1 


~ 
Hints on Drawing Up а Frequency Table.—Do not look 


through the original list and pick out all who got the 
- highest measure, e.g. 20, then all who got the next highest, 
19, and so on. This would be a waste of time and would 
“easily lead to mistakes. Instead, go through the measures 


in the list consecutively, 


and as each one is reached, put a - 
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tally or check mark opposite the appropriate measure in 
the column. When any one measure recurs for the fifth 
time, draw a line through the first four tallies. Repeat 
this on reaching the tenth, fifteenth, etc., tally. Add up 
the tallies to obtain the F column. Add up the F column 
itself to make sure that it checks with N, the total number 
of persons. 

Distributions of Classified Measures.—The above pro- 
cedure is still likely to produce an over-lengthy table, 
especially when ЈУ is large. It would be absurd, for ex- 
ample, to tabulate the heights of all the pupils of a school 
in this manner. Instead of listing the frequencies of chil- 
dren 40-0, 40-1, 40-2 . . . inches tall, we would give the 
totals with heights of 40, 41, 42 . . . inches, or the totals 
with heights of 40 upwards, 42 upwards, 44 upwards, and 
so on. That is, we would group together sets of con- 
tiguous measures under a series of classes or categories. 

Considerable care is needed in deciding upon what is, or 
is not, included in а class. When the variable is diserete— 
for example, test scores ranging from 10 to 98—we may 
call our classes 10—14, 15-19, 20-24, etc., or else 10 +, 
15 +, 20 +, etc. Both of these indicate clearly that all 
Scores of 10, 11, 12, 13, and 14 are included in the first class, 
and that the size of each class is five units. The same 
methods may be used for school marks. For instance, 
10%-14%, 15%-19% ... is unambiguous, although, 
strictly speaking, the first class includes all shades of merit 
from 94% to 144%. But with some continuous variables 
such as age or height, difficulties often arise because the 
original measures are themselves only approximations 
which represent classes of measures (cf. p. 4). Thus if 
height has been measured to the nearest tenth of an inch, 
and is tabulated to the nearest inch, we should avoid 
calling our classes 40, 41, 42 . . ., ete. For 40 might 
include 39-95 to 40-95, or 39:45 to 40-45, or 39-55 to 40:55. 
If we write 40 +, 41 +, etc., the first of these is usually 
intended. With ages, 5 --, 6 - . . . is quite clear, since 
the 5 4 class includes all children who are at, or who have 
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passed, their fifth birthday, but have not yet reached 
their sixth. 

No set rules can be laid down as to the choice of classes, 
but the following additional hints will apply to most of 
„the tables which teachers or psychologists will need. 

(1) The number of different classes is generally sóme- 
where between 6 and 15. Some tables contain more, few 
less. - When the total number of persons, N, is only 50 or 
less, a small number of classes is generally sufficient ; but 
if N is nearer 500, а more detailed table with a larger 
number of classes may be desirable. 

(2) Pick out from the original list of measures the 
highest and the lowest measure to be tabulated, and 
deduce from them the total range of the distribution. If 
the range is from 20 to 1, then 7 classes containing 8 diffe- 
rent measures each are likely to be appropriate. If the 
range is from 187 to 65, then 15 classes of 5 are a possibility. 
In most instances all the classes should contain the same 


number of different measures, i.e. they should be of the- 


same size.? 

(8) It is desirable, though not essential, to include an 
odd number of different measures in each class. The 
reason for this is that in any subsequent calculations we 
are going to regard all the measures grouped within any 
one class as lying at the midpoint of that class. I$ the 
classes include, say, 8—7, 8-12, 18—17, 18-22, etc., then we 
shall count all the 83's, 4's, 5’s, 6's, and 7'5 as 5's, and all 
from 8 to 12 as 10’s. The midpoint of a class is found by 
taking the average of the highest and lowest measures it 
сап contain; e.g. (8 4-7) — 2— 5. "The same figure is 
the midpoint of a class whose true limits are 23 to 74. 
But a class such as 89:95 to 40:95 has the rather incon- 
venient midpoint of 40-45. 

(4) With school marks it is advisable to arrange for any 


! If the distribution is very far from symmetrical, as in the case 
of incomes, school attendances, and the like, a table with unequal- 


sized classes may better portray the state of affairs. But no further 


caleulations can be carried out with such a table. 


Ge 
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round numbers, such as the 5's and 10's, to fall at the 
centres of the classes. Teachers are very apt to award 
these round numbers and to neglect intermediate ones. 
„Hence, if classes of 10—14, 15-19, 20-24 . . . are chosen, 
most of the frequencies may actually lie at 10, 15, 20 . . „ 
and it would be illegitimate to regard them as lying at the 


midpoints of 12, 17, 92. . . . The difficulty will be sur- 
mounted if classes of 8-12, 18-17, 18-22 . . . are used 
instead. 


(5) Having chosen the classes and written them out in 
a column, tabulate the F’s by the ‘tally’ method, as 
before, and check the total, N. 


Ex. 2.—Below is the distribution of the measures already 
listed іп Ex. 1. They have been grouped into classes of З 
marks each. 


Classes F 


GRAPHICAL REPRESENTATION OF DISTRIBUTIONS 


A distribution may often be portrayed more clearly by 
means of a graph, either of the type known as a histogram, 
or by a frequency polygon.! In both of these the frequen- 
cies are plotted on the vertical (Y) axis against the 
measures, or classes of measures, on the horizontal (X) axis. 
The scale of the graph and the location of the origin (the 
point of intersection of the axes) are purely п 4 of 
convenienee. Suppose, for example, that our graph paper 


! The ogive type of graph, or cumulative frequency curve, will be 
omitted here. The reader may find accounts of its construction and 


tos 8 more advanced textbooks, such as Garrett (1937) or Dawson 
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measures 6 inches X 9 inches, each inch being divided into 
tenths, and that we wish to plot the distribution given in 
Ex. 2. We might use a scale of 8 persons to the inch on 
the Y axis, and one class (8 marks) to the inch on the X 
axis. This has been done in Figs. 1-3. Fig. 4 portrays 
‘two distributions of intelligence test scores which have 
been grouped into classes 73-77, 78-82 . . . 183-187. 


01253456798 91011 1215 14 I5 I6 I7 18 19 20 x 


Ета. 1.—Histogram of distribution from Ex. 2. 


Each inch on the Y axis represents 5 persons, and each 
inch on the X axis includes 2 classes (10 points) of score. 

A completed graph generally looks best if its largest 
vertical and horizontal dimensions are roughly the same. 
Its appearance will be too peaked if the scale of frequencies 
is too large, or the scale of measures too small, and too flat 
if the converse holds. 

The origin is often placed at X = 0, but there is no need 
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for this. Thus in Figs. 2-3, it is at X = — 2, in order 
that the Y axis may be drawn at the left-hand edge of the 
graph paper. And in Fig. 4 it is at X = 100, in order that 
the Y axis may be in the centre of the page. 

The Histogram or Column Diagram.—Fig. 1 is a histo- 
gram portraying the distribution given in Ex. 2. It con- 


Ето, 2.—Incomplete frequency polygon of distribution from Ex. 2. 


sists of a series of pillars or columns drawn side by side. 
Each pillar represents one measure or, as in this instance, 
one class of measures, and its height represents the fre- 
quency of this measure or class. The marks from 0 to 20 
are written along the X axis, but note that each mark is 
represented by a space of 3 of an inch on this axis, not by 
one of the inch or tenth-inch lines. "The first class (includ- 
‚ ing 0, 1, and 2 marks) is, therefore, represented by the 
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whole width of the first inch, the second class (3, 4, and 5 
marks) by the whole width of the second inch, and so on. 

The Frequency Polygon.—All the measures included in а 
class are here represented by a point on the graph, whose 
Y co-ordinate is the frequency, and whose X co-ordinate 
*js the midpoint of the class. Thus, in Fig. 2, which repre- 
sents the same distribution as before, the midpoints are 
written along the X axis, and this time they are located 


" 
I 4 7 10 15 16 19 22 x 
Fro. 3.—Completed frequency polygon of distribution from Ex. 2. 


either at an inch or tenth-inch line. The points on the 
graph are connected up by straight lines. This figure is, 
as it were, suspended in mid-air, hence it is usual to com- 
plete it (Fig. 3) by connecting the first and last points 
with additional points whose Y co-ordinates are zero. 
The X co-ordinates of these additional points are the 
midpoints of the classes just below the lowest, and just 
above the highest classes, in the distribution, i.e. X = — 2 
and X = 22. 
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Ex. 3.—For further illustration there follow the frequency 
distributions of scores obtained by 136 men and 114 women 
Students on the Cattell Intelligence Test, Scale IIIA. The 
scores are grouped into classes of 5. The distributions for the 
two sexes are plotted as frequency polygons on the same graph 
(Fig. 4), and for the sexes combined as a histogram (Fig. 5). 
Note in the latter that the scale along the Y axis has been 
halved, since combining the sexes roughly doubles the fre- 
quencies to be plotted. 


Frequencies 
Test Scores Men Women Combined 

133 + я Я T 1 
128 + Я » 0 1 1 
123 + 5 n 5 1 6 
118 + s * 8 3 11 
113 + а + J68 9 22 
108 + А Pu rd 9 26 
108 + * . 21 14 85 
98 + б a dA 30 47 
93 + = . 28 20 48 
88 + . . 14 13 27 
83 + . Я 6 12 18 
78 + 5 Я 8 2 10 
78 + * . Е 3 0 3 
N — 186 114 250 


Relative Merits of Frequency Polygons and Histograms.— 
Frequency polygons possess the advantage over histo- 
grams that two or more distributions can be superimposed 
and compared on one graph, as in Fig. 4. But they are 
inferior in that the lines connecting successive points give 
a somewhat indccurate notion of the distribution, especi- 
ally when the number of points is small. For instance, in 
Fig. 4 the connecting line between the two points X — 110 
F—17, and X— 115 P= 18, suggests that 15 men 
students obtained intermediate scores of 1124, which is 
quite untrue. In this respect the histogram is unexception- 
able, since it gives an exact portrayal of the tabulated data. 
The total arca enclosed within a histogram corresponds 
accurately to the total number of persons, N. The area 
within a polygon only corresponds approximately. Histo- 
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grams have a further advantage, namely, that they can be 
employed when the variable is not a truly quantitative one. 
For example, examination scripts are often graded, not 
with numerical, but with letter, marks—A, A —, B +, etc. 
The frequency distribution of the numbers of examinees 
‘who are awarded each letter can be shown by a histogram. 


70 75 ВО 85 90 95 100 105 ПО 115 120 125 150 135 140 у. 


Ето. 4.—Frequency polygons of distributions from Ex. 3: 
Меп. ----- Women. 


.CuamacrERISTICS OF FREQUENCY DISTRIBUTIONS 

Smoothness of Distributions.—Here we must distinguish 
‘provisionally between * objective ° and ‘ subjective ’ meas- 
ures, and discuss the difference more thoroughly in the next 
chapter. Objective measures are those which are practically 
independent of the person who makes them, such as 
heights measured with a ruler, or the scores on a standard- 
ized educational or intelligence test. Subjective measures, 


2 E 
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by contrast, depend upon personal evaluations. They 
include, for example, the marks awarded by a teacher to 
a series of English compositions, assessments of pupils' 
characters, and the like. 

Now when graphs are drawn of frequency distributions 
of objective measures, certain characteristic features arc 
very commonly found. First, the graphs are generally 
very irregular when the frequencies are small, but the 


F 
50 | 


40 | 


о 
75 ВО 85 90 95 100 105 ПО 15 120 125 130 155 x 


Fic. 5.—Histogram of distribution of men’s and women’s scores 
combined, from Ex. 3. 


irregularities tend to become * ironed out as the numbers 
increase. And when N reaches a value of several hundreds, 
the frequency polygon or histogram approximates to a 
smooth curve. Thus the distribution in Fig. 5 is much 
more regular in its rise and fall than is the distribution in 
Fig. 6, which represents the scores of a group of 50 students, 
selected at random from the previous 250. 

Fig. 7, which is a frequency polygon of the heights of 
6,194 male adults, gives an almost smooth curve. 


о 


75 ВО 85 90 95 100 105 ПО 115 120 125 x 


Fic. 6.—Distribution of test scores of a small group. 


400 


о| А . И 


55 60 65 70 75 “ACHES 
Fig. 7.—Frequency polygon of heights of 6,194 English male adults. 
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Fic. 8.—Irregular distribution of test scores grouped in small classes. 
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20 


65 75 


85 95 105 15 125 155 145 x 
16. 9.—More regular distribution of test scores grouped in large classes. 
. 16 : 
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Frequencies may be increased not only by increasing the 
value of N, but also by grouping the measures into larger 
classes. Figs. 8-9 show frequency polygons for the same 
set of 250 measures grouped into classes of 3 and 10 respec- 
tively. The latter is distinctly smoother than the former. 
* Types of Distribution Curves.—In actual practice we can 
never obtain measures of a large enough number of persons 
for the distribution graphs to become completely smooth 
curves. Nevertheless, it is customary to represent various 
types of distributions pictorially by means of curves of 
appropriate shapes, to which obtained distributions ap- 
proximate more or less closely. For example, we can say 
that Fig. 10 repre- 
sents roughly the type F Ы 
of distribution which 
might ђе expected 
for the incomes of 
parents of elementary 
school children. The а 
greatest number | 
would probably Пе £200 £400 £600 £800 INCOME 


between £100 and Fic. 10.—Skew curve, illustrating approxi- 
£180 per annum. mate distribution of annual income i 
Very few would SOR Висине ОР Чышарацу “school 
obtain less than £80, 
and the frequencies of those at higher income levels would 
progressively diminish. This particular type of distribu- 
tion is called a skew curve, because of its asymmetry. It 
is positively skewed towards the upper end, and shows a 
piling up of cases at the lower end. А bunching at the 
upper end would give a negatively skewed curve. 

Fig. 11 is a bimodal curve such as might be obtained if 
we graphed the intelligence quotients ! of pupils in two 
schools, one containing very intelligent children of profes- 
Sional class parents, the other containing inferior children 


е 


! It is assumed that all readers know roughly what is meant by 
intelligence quotients (1.9.5). А definition may be found on p. 22, 
and a more precise account of their meaning on p. 217. 
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of very poor stock. Most of the I.Q.s fall either between 
110 and 140 or between 70 and 90, but there are a few of 
even higher or even lower intelligence, and a few (probably 
drawn from both 
F schools) of inter- 
mediate or average 
intelligence, with 

I.Q.s of 90 to 110. 
The two polygons 
which appear in Fig. 4 
are irregular on 
account of the small 
$0 Во 00 по мо 16012 frequencies involved. 
Fic. 11.—Bimodal curve, illustrating in- They approximate, 
telligence quotients of mixed groups. however, to a pair of 
curves such as those 
shown in Fig. 12. Note here that the more highly peaked 
curve (representing women’s scores) does not indicate that 
women obtain, on the whole, higher scores than men, but 
that women obtain more scores glose to the average of 
100 points, and fewer 
extreme scores; in 
other words, that men 
are more dispersed or 
spread out m their 

ability. 

It should be re- 
membered that the 
areas enclosed beneath 


distribution curves are 


го ional Fic. 12.—Two distributions of intelli- 
Р poru we the gence test scores, with different disper- 
total numbers of cases sions. 


involved. In Fig. 12 

the areas are roughly equal, since the total numbers of the 
two sexes are about the same. Fig. 13 shows the curves 
to which the distributions of scores among Honours and 
among ordinary degree students approximate. The latter 
have, en the whole, somewhat lower scores than, but are 


то во 90 Оо по О 10 


А 
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four times as numerous as, the former. It is more usual, 
when groups of such different sizes are being compared, to 
plot percentage frequencies instead of actual frequencies; 
since the curves will 
then possess the same 
drea, and differences in 
theaveragesand ranges 
or dispersions of 
measures will stand 
out better. 


THE NORMAL FRE- 70 во 90 100 но RO BO 


QUENCY DISTRIBUTION Fic. 13.—Distributions of intelligence 


Whenever a large test scores of two groups, one much 
larger than the other but of lower 
group of human average intelligence. - 


beings is taken at 

random (i.e. not specially selected like the two schools 
whose I.Q.s are graphed in Fig. 11), then objective meas- 
ures of almost any trait or ability they’ possess—physical 
or mental—will be found to conform to a certain type of 
distribution known as the normal curve. This is often 
referred to as the normal probability curve, the normal 
curve of error, or the Gaussian curve. It is symmetrical, 
bell- or cocked-hat-shaped. It indicates that the majority 
of persons in the group obtain measures round about the 
average, relatively few obtain extreme measures, and that 
the frequencies above the average are the same as the 
frequencies below it. The four curves in Figs. 12-13 belong 
to this type, and all the graphs from Fig. 4 to Fig. 9 roughly 
approximate to it, since they represent distributions of 
scores obtained by unselected groups of persons. Innum- 
crable instances of distributions which tend to this normal 
Shape could be cited: the heights of children in an 
ordinary school class, the numbers of times they can tap 
with one finger in a minute, or the numbers of words they 
can write. Variations in individual performances as well 
as differences between individuals show the same trend : 
if a person tries a hundred times to сору a line of a cortain. „ 
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length, most of his attempts will be close to his own avcrage, 
and a few will be considerably too long or too short. 
Differences between groups are also similar: if all the 
elementary school classes aged 11 + in a large city are 
given an objective arithmetic test, and the average mark 
of each is caleulated, then a graph of all these averages 
may be expected to show a normal distribution. 
Conditions which Upset the Tendency to Normality.— 
There are several factors which may disturb this normal 
tendency. First, the total frequencies involved may be 
so small that the irregularities, noted above, may mask it. 
It is usually apparent, however, even with groups of 50 or 
less (cf. Fig. 6). Secondly, the group of persons may be 
selected in such a way as to distort an otherwise normally 
distributed variable. Ber example, a school class which 
has been ‘ creamed’ will be likely to show a negatively 
skewed distribution of test scores or examination marks, 
since there will be a deficiency of pupils of high ability. 
The police force or the army will not show a normal dis- 
tribution of heights, since no recruits are accepted below 


a certain height. - 4+ 
Thirdly, there аге а few human characteristics which, 
Р E ы although objectively 


measurable, are 
known to be abnorm- 
ally distributed. 
Accident proneness is 
one such variable. 
It is found that а 
few people are liable 
to undergo a large 


" NUMBER OF Accioents number of accidents 
AIN PERIOD — ; А iod 
" in any given perio 
Fic. 14.—Approximato fre istri 8 р 
: uen is и : 
tion of accident санар. ИЛЫ of time, whether at 


home, or in factories 
[9) 1475. . . * 5 
T when driving cars, But the majority have few or no 


accidents) Thus the distribution takes roughly the form 
p ! Cf. Н. M. Vernon (1936). 
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shown in Fig. 14. Annual income also gives a skew 
distribution (cf. Fig. 10). There is good evidence that 
intelligence quotients do not conform precisely to the 
normal curve. A slight deficiency occurs in the frequencies 
of I.Q.s of 100-120, and a slight excess in frequencies of 
-I.Q.s around 90 and 180 or over. Nevertheless, the dis- 
erepancies are so small that for many practical purposes 
we may continue to regard the distribution of intelligence, 
when measured by an adequate test, as normal. 

A fourth condition (which is much commoner than the 
third) is the presence of defects in the method of measuring 
the variable. We shall see in the next chapter how such 
defects arise in the testing and examining of abilities, and 
how they should be dealt with in the light of our knowledge 
of the characteristics of frequency distributions. 

1 This result was definitely established by the recent Scottish survey 


of a truly unselected sample of children (ef, Mnemeeken, 1080), 
For a discussion of the grounds for such asymmetry, cf. Burt (1987), 


= 


Accessioned №1445 
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v CHAPTER II 


PRINCIPLES OF MENTAL MEASUREMENT AND 
MARKING 


We are all accustomed nowadays to regarding a scholastic 
capacity (e.g. arithmetical ability), like height or weight, 
as ranging from a very small amount in some people to a 
very large amount in others, and to assessing these amounts 
in numerical units such as pereentage marks. General 
intelligence and other aptitudes can similarly be graded as 
quantitative variables, average intelligence being generally 
denoted by an Intelligence Quotient or I.Q. of 100, superior 
or inferior intelligence by larger or smaller numbers. Yet 
_ the conception of “scaling mental qualities in à manner 
similar to physical ones is of fairly recent origin —we owe 
it largely to the work of Sir Francis Galton—and it is still 
somewhat limited in application and beset with many 
difficulties. A human ability or trait is far more complex 
than a length, a weight, or a time ; it can never be com- 


such a scale can hardly 
: 8 of political opinion ; it 
nevitably involves Some over-simplification and artifici- 
ality. Particularly in the field of temperament and 
character does such distortion arise, for here people appear 
to us to vary from one another, not merely in amount, but 
also in kind or quality: The same is true, though to а 
1 Fer a fuller discussion of these points, cf. Vernon (19380). 
22 
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lesser extent, in the field of abilities. Thus the precise 
nature of intelligence is ап extremely controversial matter, 
and the common tendency to regard the I.Q. as an accurate 
measure of some definite and stable mental entity is, as we 
shall see below, an egregious misinterpretation. 

No mental trait or ability can be measured directly by 
laying some kind of ruler alongside it. Instead we have 
to grade it by observing its expression in words or actions. 
Arithmetical ability, for example, is shown by. success at 
а series of sums; moral character by certain modes of 
behaviour. Indeed, for scientific purposes, we have to 
regard a trait or ability, not as some power or faculty in 
the mind, but as а particular class or category of related 
performances, e.g. of the arithmetical or the moral kind. 
And special statistical techniques are required for deciding 
just what performances should be included under one 
heading, as representative of one ability or trait (cf. Chap. 
VIII). Since we cannot, of course, observe and record all 
the performances which go to make up а mental character- 2 
istic, we are forced to take samples of them. Our pro- — 
cedure, as Binet pointed out, is analogous to that of the 
mining engineer who cannot examine all the ground in a 
given locality, but instead takes borings at various points, 
and from these samples deduces with reasonable accuracy 
what the district as a whole is like. Similarly, the psycho- 
logist measures intelligence, or the teacher measures 
achievement in arithmetic, on the basis of the child’s 
success at sample intellectual and arithmetical tasks. An 
‘arithmetic examination sets, perhaps, six questions, the 
answers to which constitute a sampling оѓ ће general field 
of arithmetical knowledge which the pupil is supposed to 
have acquired. 

How, then, is success at such sample performances 
expressed in numerical terms ? The various methods of 
marking or scoring mental tests and examinations may be 
classified as follows : Р 

“А. Enumeration or count scoring. 
B. Ranking. › 
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C. Qualitative grading and classifying. 
D. Numerical evaluation. : 
E. Mixed numerical evaluation and count scoring. 


А. ENUMERATION ОБ Count SCORING 


This type of marking is the only objective type, in the. 
sense that each person's niark is quite independent of the 
marker's subjective evaluation of his work. When, for 
example, a series of simple arithmetie sums is set, the 
answers to which are all either right or wrong, or when one 


Which are counted up, it is largely restricted to very simple 
kinds of scholastice work. We shall see later, however 
(Chap. XII), that attainments in more advanced and 
complex school or university work can be fairly adequately 


general intelligence tests follow this method, so as to 
eliminate the personal element in scoring. Steel and 


vocabulary, Seatence-structure, and sentence-linki 
they contain. The Scoring 
are known as performance tests 
aptitudes. The child is set s 
task, and a record is obtaj 


seconds taken, 


This type of marking includes tw. 


o distinct methods of 
measurement which may be termed 


the rate method and 
ination may contain a 
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from quite simple to very hard tasks, such that all. persons 
being tested can answer some of them, and scarcely any 
ог none can manage all of them. Неге no time limit is 
necessary. The score is again based on enumerating the 
tasks successfully completed, since those whose ability, 
knowledge, or ‘ power ' is greater are able to answer tasks 
of a higher level of difficulty. Greene and Jorgenson 
(1986) aptly compare these two methods to a hurdle race 
and a high-jump competition respectively. The table of 
tests given in Chap. IX lists several examples of both. 
Thus, Ballard's One-minute Reading and Arithmetic Tests 
and Burt's Reading Fluency Test belong to the former 
type. Burt’s Graded Vocabulary Tests of Reading and 
Spelling and: the Stanford-Binet Intélligence Test belong 
to the latter. Most tests and examinations, however, | 
combine the two, i.e. they include easy and difficult items, 
but also impose a time limit, so that the score depends on 
power and rate of working. М 

Comparison of Mental Count Scores with Physical 
Measures.—We must ask now whether such count scores 
can legitimately be regarded as measures of mental abili- 
ties, analogous, say, to measures of height and weight or of 
Physical abilities such as rate of running, power of muscular 
contraction, and the like. Actually the numbers in terms 
Of which abilities are expressed show several important 
differences from numbers such as inches on a ruler or 
Pounds on a weighing machine. The latier, physical, 
Units are all equal to one another at any point on the scale. 
E.g. the difference between 50 and 49 inches is precisely 
the same as the difference between 40 and 89 inches. 
Moreover, they always have a zero point : e.g. 0 Ib. 
Means no weight at all. Mental ability units often lack 
these features, and this fact precludes us from adding, 
averaging, or otherwise treating them as if they were true 
numbers, к à 

For example, a one-year-old child might fail to pass any 
9f the Stanford-Binet Intelligence Tests, but his zero score 
Shows, not that he possesses zero intelligence, but thet the | 
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tests аге an inadequate measuring instrument for this 
Purpose. Again, a child with an I.Q. of 120 cannot ђе 
considered twice as intelligent as one with an I.Q. of 60, 
because we do not know what is the zero point of the I.Q. 
scale. Even in a rate test of, say, reading speed, a child 
who reads 50 simple words in a minute, and so scores twice 
as much 45 one who reads 95 words in the same time, 
cannot properly be said to be twice as good at reading. 
ТЕ is obvious that the units in a rate test, such as words 
read, can never be precisely uniform in difficulty. Similarly, 
the various errors in a dictation test may vary considerably 
in seriousness. It should be noted also that two children 
who obtain the same score on a rate test do not.necessarily 
accomplish precisely*the same items, since one may fail 
on, or omit, certain items which the other passes; and 
they may each make the same number of mistakes in the 
spelling of a certain passage, many of which are different. 
Our tests, then, might be compared to a weighing machine 
, With a large number of pound weights, some of which arc 
slightly heavier, some slightly lighter, than a pound. 
When using the machine to weigh children we do not always 
pick out precisely the same set of weights. Now, in spite 
. Of these defects in the Weights, we can claim that two 


Y the same amount. For 
-Binet Scale (from Year 
pposed to represent an 
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increment of two months of mental асел But it is gener- 
ally admitted nowadays that mental age units are not 
equivalent, and that the difference between a child of 
M.A. 181 and one of 18 is probably much smaller 
than the difference between children of 6} and 6. 


"Here, then, it is as if we ran out of (nominal) pound 


weights after a certain point, and had to use further 
weights which progressively increased in heaviness. Obvi- 
ously this leads to grave difficulties in the mental age 
method of scoring (cf. Chap. IV). Several group intelli- 
gence tests, however, do manage to maintain approxi- 
mately equal increments throughout. 

The units of measurement of many physical skills present 
similar problems, as Blackburn (1986) has pointed out. 
That equal increments of score do not necessarily represent 
equal increments of skill can be seen from the example of 
golf scores. It is decidedly easier for a player who requires 
140 strokes for a round to improve to 180, than it is for a 
player who takes 70 to improve to 60 (or even—if we could 
assume golf scores to follow a geometrical rather than an 
arithmetical progression—to 65). Here, also, it is impos- 
sible to establish a zero point of ability, or to calculate 
ratios between abilities. 

Special techniques have been devised by Thorndike, 
Thurstone, and others for the production "ов. measuring 
scales with equal units, and for determining the zero points 
on such scales. They are too complex for description here. 
But they depend in the main upon the principle of normal 
distribution of objective measures, which was enunciated 
in Chap. I. Well-constructed tests always tend to yield a 
normal frequency distribution ; hence we are entitled to 
regard departures from normality as indicative of bad 
construction or bad scoring. For example, a mental test 
which contains plenty of easy and moderately difficult 
items, but whose later items increase too rapidly in diffi- 
culty to allow sufficient headroom, will give a negatively 
skewed distribution. The analogy with a weighing 

1 For a definition and discussion of mental age, cf. pp. 82-87. © б 
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machine would be as follows : suppose that the available 
weights beyond the seventieth became progressively bigger 
‘than 11Ь., and that a large group of children were measured 
whose true weights averaged about 70 lb., and ranged from 
50 to 90 lb. The distribution would then be distorted from 
normal to negative skew type, as in Fig. 15. Conversely, 
& test which contains 
insufficient: easy items, 
so that it only differen- 
tiates effectively among 
the better pupils, is 
likely to show а posi- 
tively skewed distribu- 
tion. 

Application to School 
Je 89 90 Marks and Examina- 


ж. eee distribution (а) altered tions.—It is difficult to 
О negative skew curve 6) by increas- 1 inci 
ing the size of the higher (ру apply this principle to 


E school marks, since the 
groups are usually so small (less than 50) that great irregu- 
larities in distribution are sure to arise. The following 

table, however, gives the marks obtained by two 11 + 

classes in an ordinary school on an objective arithmetic 

test. Owing to the deficiency in difficult questions, many 


Pupils at the top got full marks (80), and the real extent . 
of their ability was not tested at all, 


50 60 


ias dom abnormal frequency distribution of arithmetic 

Mark F 
50 15 

70 + 
60 + Е 
Sis. 14 
20 sr ‚ 17 
30 4 : 
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Further problems occur in the interpretation of count 
Scores. Physical measures, such as 60 inches or 15 
seconds, always represent definite amounts. No doubt 
exists as to their largeness or smallness, because an inch 
or а second has an absolute, standard size. Laymen and 
teachers often wrongly suppose that the same is true of 
mental measures, such as the mark awarded to a pupil's 
work. They talk of 15/20 or 60% as if these numbers im- 
plied the same level of achievement whenever, or by whom- 
Soever, they are awarded. Yet it is surely obvious that 
such scores, by themselves, teli us nothing at all about the 
goodness or poorness of achievement. Such statements 
аз: * Johnny is doing well at arithmetic. He got 8 sums 
out of 10 correct," or * His dictation is terrible; he made 
15 mistakes "—are quite meaningless unless we also know 
the diffieulty of the sums and of the passage dictated. 
Johnny may be the poorest arithmetician in his class, and 
yet score 8/10 if sufficiently simple sums are set. Or he may 
be the best speller in his class, and yet make 15 mistakes if 
the passage happens to be taken, say, from a textbook of 
organic chemistry. The speaker is, of course, implying 
that the tasks are of a level of difficulty such that most of 
the class score less than 8 and —15. In other words, the 
marks possess only a relative, not an absolute, significance. 
Again, when an inspector or head teacher complains that 
the average examination mark of a class is ‘ too low,’ he is 
presumably comparing the mark and the difficulty of the 
examination with subjective recollections of the achieve- 
ments of other similar classes. Undoubtedly there is a 
great deal of loose thinking about these matters, and much 
can be learnt from the more systematic procedure of the: 
Scientifie mental tester. 

The tester realizes that the goodness or poorness of a 
Child's test performance can be interpreted only by com- 
paring it with the performances of other similar children. 
Hence he applies his test to large numbers of children, 
Similar as to age and other relevant circumstances, scores 
their performances all in the same way, and can then state 
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definitely whether a certain performance is superior or 
inferior, and how much better or poorer it is than the 
average. This procedure is called the establishment of 
norms or standards for the test. A discussion of the main 
types of test norms is given in Chap. IV, where it is also 
shown how the adoption of analogous methods may greatly » 
help in the interpretation of school and examination marks. 


B. Rankine 

Ranking means listing a group of pupils or students in 
order of their attainment, 1st, 2nd, ete., down to bottom. 
This method may be applied to any form of scholastic 
work, including composition, handwriting, and the like, 
which are scarcely amenable to count scoring. It is actu- 
ally the soundest of all forms of marking, in spite of its 
many limitations, because these limitations are so obvious 
and so easily recognized. The first drawback is that the 
number of persons who can be ranked is rather small. 
Even the arrangement of twenty in order involves a good 
` deal of uncertainty, especially near the middle of the list. 
Those at the extremes are usually more readily discrimi- 
nated from one another. Next, it is clear that these ranks 
do not provide a true numerical scale, since the numbers 
1, 2, 8, ete., are by no means equally spaced. Usually the 
differences between the attainments of Nos. 1 and 2, 2 and 
8, +8 and 19, 19 and 20 (out of 20) are greater than the 
differences between Nos. 9 and 10, 10 and 11—hence the 
difficulty, just mentioned, in ranking the medium pupils. 
The frequency distribution is rectangular (Fig. 16), and is 
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Fic. 16.—Rectangular frequency distribution given by a rank order. 


entirely unlike the normal curve. These numbers, there- 
fore, may give quite false impressions if they are added, 
averaged, or otherwise manipulated. For example, a 
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pupil who is first in one examination and third in another 
should be regarded as better than, not as equal to, a pupil 
who is second in both (cf. p. 70).' 

A third main drawback is that these numbers are always 
relative to the particular set of persons who have been 
tanked. 'Thirtieth place is not an unequivocal number 
like 80° on a thermometer. For 30th out of 82 is much 
poorer than 30th out of 50. Moreover, if any one person 
higher on the list should happen to be absent, the 30th 
rank would change to 29th. It is not even analogous to, 
say, 80 mistakes in a dictation test, since such a test can 
be applied to large numbers of children, and norms can be 
collected which will tell us how low is the level of ability 
to, which 30 mistakes corresponds. In contrast, a rank 
Position ean never be so interpreted with reference to 
outside standards. 


C. QUALITATIVE GRADING AND CLASSIFYING 
Examination scripts, essays, ete., are often merely classi- 


fied into sore three to ten grades which are distinguished’ 


by lettens (А, А —, B +, ete.3 a B, ete.) ог by names 
(Credit, Pass, Fail; 156 class, 2nd class, ete.; Excellent, 
Very Good, Good, еїс.). This is a more primitive type of 
grading than numerical evaluation (e.g. on a 0 to 10, or a 
Percentage, scale), though it is of course often applied to 
Scripts which have already been awarded percentage 
Marks. For instance, examinees whose mark is 80% or 
More are informed that they have obtained First Class 
or Credit, and so on. 

Since these grades make no pretence of being numerical 
Measures, we need not cavil at the eccentricity of their 
frequeney distributions. They are, however, liable to 
another, still more serious, defect which is not found 
among count scores or ranks, namely that different 
б The numbers can, however, be converted into a normally dis- 

uted set of scores (cf. Garrett's or Guilford's books). And it is 
Possible, without any conversion, to correlate one rank order with 
Den e.g. to compare а teacher's ranking of school work with the 

Pils’ results oa a group intelligence test (cf. Chap. VI). 


ES 
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teachers or examiners often employ widely different scales 
or standards of grading. Although they are supposed to 
represent absolute evaluations of achievement, they are 
actually found by experimental investigation to involve 
such gross discrepancies that neither pupils, parents, nor 
teachers can hope to interpret their significance correctly. 
For example, Hartog and Rhodes (1935) arranged for 48 
School Certificate French Scripts to be graded indepen- 
dently by six experienced examiners. One of the examiners 
awarded 46 credits, 2 passes, and 0 failures. Another gave 
17 credits, 12 passes, and 19 failures, The remaining four 
were intermediate. Similarly, Boyd (1924) had 26 essays 
by 11 4- children graded by 271 teachers as either Excel- 
lent, VG +, VG, G+, G, Moderate, or Unsatisfactory. 
Some of the teachers awarded one or other of the two 
highest grades to as many as 15 of the essays, others to 
none ofthem. Some also awarded one or other of the two 
lowest grades to 17 of the essays, others to none of them. 
In view of such results, it is clear that there is no agreed 
i meaning as to the absolute amount of ability which these 
gradings should signify. A mark of Excellent may repre- 
sent a high level of ability if it is awarded by a teacher 
who very seldom gives Excellents. It may mean little 
better than average ability if it is awarded by a teacher 
who habitually grades nearly half his papers as Excellent. 
In óther words, this type of marking is actually a relative 
one, like ranking. A mark of G +, or B —, or Pass, etc., 
only achieves any meaning if we know what proportions 
of candidates obtained better or poorer marks, in just the 
same way asa rank position of 30th can only be interpreted 
» = know how many candidates were ranked lower than 
Qualitative grading should then be regarded as a form 
of ranking where, instead of there being as many different 
ranks as there are pupils or examinees, there-are only 
Some 3 to 10 ranks, any one of which may be awarded to 
Several pupils whose work is of approximately equal merit. 


‚ Its grave defect 15 that any one Script is not usually com- 
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pared directly with the scripts of a particular group of 
pupils before receiving its grade; it is merely compared 
with the marker's more or less vague recollections of the 
scripts which he has marked in the past. The only possible 
rational meaning for Excellent or First Class is that the 
Script which receives this grade seems to him to rank 
higher, or to be relatively better, than almost all the 
Scripts produced by similar pupils which he has come 
across. The other grades are similarly referred to rough 
subjective standards which he has set up during his 
previous marking experience. And since his experience 
naturally differs from all other markers’ experiences, so will 
his standards differ, unless he has taken the trouble to 
discuss and compare them with others,.and to try to adjust 
them accordingly. 

Finally, as in the case of rank orders, there is no possi- 
bility of exact comparison of such grades with external 
standards or norms, and they cannot be added or averaged 


m any satisfactory way. 


D. NUMERICAL ÉVALUATION 
and 
E. MIXED NUMERICAL ÉVALUATION AND COUNT SCORING 


Numerical evaluation consists, nominally, in assigning 
absolute numerical grades to pupils’ abilities, school work, 
ог examination scripts, cither on а 0—10, 0-20, percentage, 
Ог some other scale. It exhibits all the ambiguities of the 
Qualitative grading just described, and possesses several 
additional defects of its own. In other words, although it 
is the most commonly used, it is quite the worst type of 
Marking, Unfortunately, since it involves numbers, it is 
frequently confused with the count scores obtained from 
objective marking (Type A). Teachers assume that the 
label of, say, 15/20 which they award to an English composi- 
tion is equivalent to a score of 15 correct arithmetic sums 
Out of 20. Often the two become inextricably mixed. 


! When the group of pupils is a large one, their grades can be con- 
verted into rational scores ; cf. Dawson (1933), pp. 91-3. 
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For instance, partial credits involving subjective judgments 
of merit may be awarded to partially right sums. Or an 
examination in history, science, languages, etc., may in- 
clude, say, five questions each of which is marked sub- 
jectively out of 20, the marks then being added as if they 
were count scores to yield percentage totals. Or some 
analytie plan of marking тлау be adopted whereby one 
mark is counted for each correct fact or idea, and so many 
are added for subjective impressions of style or general 
merit. 

Distributions of such marks are sometimes smooth and 
symmetrical, but are often wildly irregular or skewed. 
In an examination where а pass mark of, say, 50% has 
been fixed, it is coramon to find very large numbers of 
50's, 51's, and 52's, but no 45's to 49's. Some teachers 
show peculiar idiosyncrasies, such as avoiding certain 
marks altogether. For instance, they may award 9, 8, 6, 
5, or 8 out of 10 very often, but scarcely ever give 10, 7, 4, 
2, or 1. Two illustrations may be given of distributions 
| among students or pupils who also took an examination · 
Which was objectively scored. 

Ex. 5.—Distributions of (Subjective) Marks for Reading and 
(Objective) Marks for Geography. 


Mark Reading F Geography Ё 
10 24 2 
9 25 3 
8 18 8 
7 6 10 
6 4 11 
5 2 9 
4 о 9 
3 0 7 
2 о 10 
* 1 0 4 
0 0 1 
74 74 


The reading marks in Ex. 5 are fairly typical of the dis- 
tribution adopted by teachers in evaluating this subject, 
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or composition, or handwriting. Note that this teacher 
was not, as she supposed, marking out of 10, but out of 6, 
since she only used 6 different grades. The fact that about 
one-third of the class get 10 implies that the work of all 
these children was equally good, which is certainly not 
Ееце. The geography marks awarded by the same teacher 
were derived from a new-type test paper, and so give a 
roughly normal distribution. It is surely obvious that 5/10 
for geography does not have the same significance, nor 
represent the same achievement, as 5/10 for reading. In the 
One it means approximately the average for the class, in 
the other it means right at the bottom of the class. But 
the children themselves and their parents are likely to 
assume that 5/10 always means the same thing. 

FA 

40 
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80 90% 
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rks from (a) an objective examina- 


Fro m 
* 17.—Distributi ntage тпа: 1 
istribütions Of pereo ue marked examination. 


tion, and (b) a subjectively 

Fig. 17 compares the distributions of percentage marks 
obtained by a large group of students on a new-type 
*Xamination and on an ordinary examination of the essay- 
type. The former, though somewhat irregular. tends 


36 THE MEASUREMENT OF ABILITIES 


towards the normal, symmetrical shape, and includes 
marks ranging from 19% to 86%. The latter is skewed, 
and it ignores the bottom half of the available scale of 
marks, since it ranges only from 50% to 90 %. 

Education authorities would surely distrust a school 
doctor the weights on whose weighing machine varied, 
say, from $ lb. to 1} lb. each. Yet the units implied by 
the distributions which many of their teachers employ are 
probably quite as variable as this, unless some scheme of 
rationalization is applied (cf. Chap. IV). It is notorious 
also that some faculties in а university use quite different 
scales of marks; or are much more lenient than other 
faculties. 

Tt must be obvious that such marks are not true measure- 
ments. 6095 does not represent qy of anything what- 
Soever. True, it lies a certain distance between 0 and 100, 
but as nobody can define what is meant either by zero or 
perfect ability, that fact does not help much. Like the 
qualitative grades of Type C, they are merely labels whose 
precise size is fixed more or less vaguely by convention. 
Naturally the convention varies in different institutions. 
For example, the pass standard in degree examinations 
is called 60% at some universities, 40%, 20%, or even 
10% at others. 

Numerical grading, like qualitative grading, actually 
co.isists of ranking the scripts in relation to the marker's 
subjective past experience. Tt differs merely in that 
number labels are substituted for verbal ones, and that 
most numerical scales contain a larger number of different 
labels than do qualitative Scales. This latter feature leads 
Pie: When an essay or a single 
examination answer has to be marked on a percentage 


to 90%). Actually it has been proved by experiment that 
people cannot consistently discriminate more than about 
10 grades, and some would say only 5. 
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THE REFORMATION OF MARKING 


The would-be reformer of marks meets with many 
objections and criticisms. Most of them either rest upon 
ignorance of elementary statistical principles, or else ‘ boil 
cown’ to the view that the systems in vogue at present 
are firmly established, generally understood by all con- 
cerned, and work too well and smoothly to be upset. 
While such a reactionary attitude is entirely unwarranted, 
yet there are admittedly many occasions in school marking 
where the rationality of the scales employed matters little, 
or where an eccentric distribution may be an actual 
advantage. 

Some important examinations do not aim at measuring 
the abilities of all the pupils, but purely at selecting a 
certain proportion of them. For instance, the English 
Special Place Examination attempts to pick out the best 
10% or so of candidates for secondary school scholarships. 


The Scottish Qualifying Examination hasroughly the oppo- _ 


Site purpose, namely to eliminate the poorest pupils who 
are unfit for secondary or advanced work. Mowat (1938) 
Points out that the most appropriate examinations or tests 
in these two instances will be ones which will spread out, 
Or discriminate between, candidates whose marks are close 
to the respective borderlines. The former ‘then should 
Yield a positively skewed distribution and give plenty of 
headroom, while the latter should yield a negatively 
Skewed distribution and give plenty of footroom. The 
Marks of the bottom three-quarters of candidates in the 
for mer, and of the top three-quarters in the latter, scarcely 
Matter, Hence, it would be a waste of time to construct 
an examination which would spread them all out in the 
Proper normal fashion. А 
Next we can distinguish examinations which are con- 
Stantly set in schools for such pedagogic purposes as ensur- 
ng that all, or practically all, the pupils have successfully 
accomplished a certain stage in their work, or that they 
ave learnt, say, some French grammar, chemical formule, 


4 


сар E 
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or history notes, set to them for preparation. Other exer- 
cises are needed, notably in mathematical subjects, for 
practice in the operations which the pupils are in process 
of acquiring. In these instances we may expect distribu- 
tions to show a strong negative skew. The tests will 
naturally be designed so that the majority can get full ог 
nearly full marks. The weaklings will be somewhat more 
spread out, but even the poorest may be able to attain half 
marks. Doubtless it is because these abnormal distribu- 
tions have become habitual in classwork that most teachers 
are so blind to the defects in their examination marking. 
Such tests which are used as pedagogic aids should be kept 
quite distinct from tests or examinations designed for pur- 
poses of measuring .abilities. When the object is to com- 
pare different pupils or to find how able a pupil may be 
in different subjects, then they should certainly be con- 
structed and marked according to the principles outlined 
below. Further, when a standardized (published) test of 
intelligence or attainment is to be given to a class or to a 
group of students, it should always be chosen so as to yield 
a proper frequency distribution (cf. pp. 210-11), 

Rationalization of Marking.—Marking will probably 
never become scientific so long as it is confused with value 
judgments. Thus the first point to realize is that marking is 
а process of cifferentiating between the various levels of ability 
existent among the scripts which are being marked. Either 
it is a process of arranging them in rank order or of assign- 
ing them to one of a limited number of qualitative or 
numerical grades. But it is not a process of deciding 
whether all, or some, of the scripts are ‘ good’ ог ‘ bad,’ 
whether the pupils * ought to have done better? or * have 
tried hard, whether they ‘pass’ or ‘fail? All such 
decisions should be postponed until the actual marking is 
completed. 

Often it is desirable that the examination should be of 
the new- or objective-type. The reasons for this advice 
appear in a later chapter, but onc of the main ones is that 
marking is greatly simplified and rendered far fairer. In 
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Such examinations difficulties arise, not in the scoring 
(which is purely of the count type), but in the setting. 
They should be of such a length, and contain questions of 
sufficient ease and difficulty, to differentiate among both 
the poorest and the best pupils, and to yield ап approxi- 
mately normal distribution. Ideally the average pupil in 
the class, or the average examinee, should get as near as 
Possible to half marks, the poorest pupil little better than 
zero, and the best pupil little less than full marks. For if 
по examinee scores less than, say, 40%, then it is obvious 
that the easy questions at the beginning are а waste of 
€verybody’s time. And if no one scores more than, say, 
70 %, the most difficult questions at the end are also wasted. 
Humanitarian Objections.—An immediate, and quite 
Pardonable, objection will be raised against this ideal, 
namely, that if marks range from nearly 0% to nearly 
100%, the poorer pupils will become greatly discouraged 
When they habitually obtain marks of around 10 % to 20 95 
and their parents will become bitterly, resentful. Some 
teachers ere so conscious of the stimulating effects of- 
Success and the depressing effects of failure, that they 
deliberately distort marks, giving the duller pupils who 
ауе tried hard more than their work deserves, and the 
tighter ones who have not done so well as they might less 
than they deserve. It may be that this kindheartedness 
9f teachers towards the dullards has been partly responsible 
Ог forcing up their distributions of marks into the absurd 
Shapes typified by Ex. 5 above. Perhaps equally potent 
has been their fear of the criticisms of head teachers and 
inspectors if they admitted their failure to teach the almost 
Meducable pupils by marking their work on its true merits. 
orthy as such humanitarian objections may be, they do 
not constitute an excuse for bad test construction or bad 
Marking, But the solution is not difficult. We shall see 
Chap. IV that the actual average, and range, or the scale 
. Marks, employed do not matter in the slightest, pro- 
` Vided that everyone agrees to use the same scale. Either, 
en, the test may contain sufficient (useless) questions 
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which are easy enough to encourage the poor pupils ; = 
the test may be rationally constructed, as indicated qu" e 
80 2s to yield 2, renge of, SAy, 4 to 26 out of 80, or 10 to 3 
out of 80; and when the marking is complete the mar n 
сап readily he converted into some more 2cceptable an а 
conventional scale, such as 40% to 90 %, by the methods 
described in Chap. ТУ. A third alternative is to withhold 
all numerical grades from pupils and parents, since they 
are so liable to misinterpretation, and instead, once the 
scientific marking is done and is recorded for administra- 
tive purposes, to reconsider the scripts and attach any 
verbal or letter labels to them that appear suitable for 
upil and parental consumption. 3 
ý Grading! of Complex i nas Products.—Certain 
types of. work such as English composition and айны 
usually have to be graded on the basis of total пш 
impression. The same is true of many examination Scrip 
when the answers consist of a series of historical, scientific, 
or other essays. Here all the answers to any one ае 
should invariably be marked before the examiner procee i 
to another question. Whenever there are about w 
more such essays (or other educational products) to а 
dealt with, resort should be made to the quality or produc 
scale method (cf. p. 179). The marker should first skim 
through a fair number of the Scripts until he has gained a 
rough impression of their average level and of their extreme 
No account whatever should be 


· Next he should look through 
them more carefully and 


average, set it aside, and call it 5, 
be taken which appear t 
bunch ; у 
Finally, others should be i 

levels intermediate betw: 
5 and 1 (or 2). These s 


hould be numbered (8), 7, 6, 4, 8, 
(2). A good deal of Te 


vision of first impressions may be 
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required before а final decision is reached as to these 
nine (or seven) samples. Once they are chosen the mark- 
ing is straightforward. The rest of the scripts are simply 
compared one by one with the samples and each is awarded 
the mark of the sample which it most closely resembles in 
merit. Should one or two turn up which are superior to 
No. 9 or inferior to No, 1, they may be given 10 or 0. 

One obvious advantage of this plan is that it enables the 
marker to be self-consistent and to maintain the same 
Standards throughout. Another is that it does not 
demand impossibly fine discriminations. The number of 
Samples can be reduced to five, or extended to eleven, if 
desired, or some scripts may be given a mark midway 
between two samples, e.g. 63. But usually seven or nine 
grades are sufficient. Most important, however, is the 
faet that the final distribution will be fairly smooth and 
symmetrical if the number of scripts is large and the 
sampies have been well chosen. Even if their number is 


Small there should be a majority of 6's, 55, and 4’s, anda _ 


minority of 9's, 85, 2's, and 15; and there should be 
&bout as many above as below 5. The precise marks 
Biven to the samples do not matter. They may, for ex- 
ample, equally well be called 90%, 85%» 80% .. . 55% 
0%, so long as the marker resolutely avoids all tempta- 
lons to withhold 90% on the irrelevant grouùds that the 
est scripts in the bunch do not seem to him to be ‘ up to 
% standard. It is, however, probably better to do the 
marking of each question on а 0-10 scale. Such marks 
for several questions їп an examination can legitimately 
е added. The final totais can then be converted into 
any desired conventional scale, just as can the scores from 
ап objective test. "m Р 
Other Types of Marking.—tt is more difficult to give 
efinite recommendations as to the marking of work of 
Mixed type, where only part of the merit or demerit resides 
ih points which can be counted. In foreign language 
examinations, history, science, ес. both correct or incor- 
Teet facts, and general style, or originality and the Jike, 


ao 


42 THE MEASUREMENT OF ABILITIES 


may need assessment. Sometimes these quantitative and 
qualitative features can be graded separately and later 
combined in due proportions. It might be better if, as is 
suggested in Chap. XIV, such papers were not set ; either 
they should be definitely new-type, or definitely of the 
qualitative essay-type. Nevertheless, the main requisites 
of good marking should always be attainable, namely 8 
similar distribution and similar average level of marks in | 
all subjects, or in all the questions set in any one subject, 
and an approximation of this distribution to normality. 

Still more difficult are the problems of the marker who 
has only a few scripts with which to deal, e.g. a university 
external examiner who is called in to assess half a dozen 
degree candidates. ° Even if the papers consisted of per- 
fectly constructed new-type tests, they would inevitably 
yield extremely irregular and variable distributions. The 
examiner is, therefore, forced to employ subjective тесоПес- 
tions of the merits of similar candidates whom he has 
marked in the past. Nevertheless, he should keep the 
normal curve in mind, for in the long run all the scripts 
that he meets are likely to be normally distributed. He 
should try, therefore, to build up definite mental specifica- 
tions of his 9, 8, 7...9, 1 levels, and attempt, by 
thorough discussion with the internal examiners, to decide 
е re on this scale the First Class, Pass, ог other standards 
all. 

Some additional points bearing on these plans, and the 
answers to some of the possible criticisms, will be discusse 
when we have studied, in the next chapter, the statistical 


conceptions of percentiles, averages, and standard devia- 
tions, 


CHAPTER III 


STATISTICAL METHODS FOR HANDLING 
FREQUENCY DISTRIBUTIONS 


Numerical Description of Distributions So far we have 
dealt with various types of distributions of measures almost 
wholly in verbal terms, and have compared one distribution 
with another mainly ‘ by eye.’ More exact study of them 
necessitates quantitative description: and definition. We 
need first a numerical expression for the © general run ’ of 
the measures, and, secondly, some index of the range or 
spread of measures. For instance, if we want to know 
how intelligent Johnny Smith is, and ask a psychologist, 
he will tell us Johnny's score on an intelligence test and, in 
order to enable us to interpret this score, he will add that 
the normal score for children of Johnny’s age is so and so, 
and that Johnny’s score falls at such and such a level in 
the distribution. Again, if we ask a teacher what are the 
heights of the children in her class, she might in reply show 
us a complete table of the distribution of heights, or she 
might tell us with greater precision that the average was 


50 inches and the range from 42 to 57 inches. 
Actually the range is not a good index of the way the 
d below the average, since 


measures are spread out above an 
it depends on only two of the measures, those obtained by 
the, extreme pupils who were measured. For example, 
although this class contains children ranging from 42 to 
57 inches, all the rest might lie between 46 and 54 inches, 
and we would hardly get а fair picture of the spread if we 
were only told that the total range is 15 inches. A much 
better index is called the Standard Deviation—often 
abbreviated to S.D. or в, which will be described below. 

A distribution of measures can then generally be defined 

43 
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numerically by means of its average and standard devia- 
Нел. But.we shall describe first an alternative system 
"which possesses certain practical advantages. This is 
based on what are known as the median, quartiles, and 
deciles or percentiles. These terms all refer to certain 
levels or fixed points in a distribution. Suppose the 
measures to be tabulated or arranged in order, then the 
median is the middle measure, i.c. the measure half-way 
up counting from the bottom, or half-way down counting 
from the top. The lower and upper quartiles, usually 
referred to as Q, and Оз, are the measures one-quarter and 
three-quarters way up. The decile measures are one- 
tenth, two-tenths, etc., and the percentiles one-hundredth, 
two-hundredths, etc., of the way up. Stated in a different 
way, the Xth percentile is the measure below which X% 
of all the measures fall. 


Determining the Median.—Ex. 6.— Consider the following 
distribution of only five measures. Clearly the median is 15, 


that is the third mcasure up from the bottom or down from 
che top. 


4 
From this example it may be seen that the median is 
N+1 
the z th measure, not the oth. 


Ex. 7.— In the following distributions, N — 8, hence NI 


= 4. 


~ 


соо ч чо 0 о 
селе 
~ 
to 
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Thus the median is half-way between the fourth and fifth 
measures from the bottom. In the first distribution the fourth 
and fifth measures are both 7, hence the median is 7. In the 
Second, the fourth and fifth measures are 6 and 7, hence we 
have to take 63 as the median. In the third distribution the 
data are too sparse for an accurate median to be determined. 
"The best we can do is to interpolate and call it 105. 


Determining the Quartiles and Percentiles.—The quarti 
are similarly obtained by counting the measures which are 


N 
n and 2222 from the bottom. For the distribu- 


tion given in Ex. 1 (p. 5), N — 50, hence the required 
measures are 122 and 381 up the list. The 11th, 12th, and 
18th measures are 9, the 87th to 48rd measures are all 17, 
hence Q, = 9, Q, = 17. Е 

Of the percentiles, the most useful are the 100th or 99th, 
90th, 75th, 50th, 25th, 10th, 156, or Oth. The 100th and , 
Oth are, of course, the top and bottom measures in the 
distribution, respectively. The 75th and 25th are the 
quartiles, the 50th is the median. The 90th percentile ог“ ~ 
9th decile is the measure which is э E 3) som the 
bottom, 

Determining Percentiles from Tabulated Data.—When 
dealing with measures which have been grouped into 
Classes, these various levels are likely to fall somewhere. in 
the middle of a class. For instance, in Ex. 8 (p. 12), the 
Median of 950 measures is the 1254th, which lies some- 
Where within the class 98-102. Its position is found by 
Interpolation, and the following formula may be used.* 

| = s) . 
X,—L- uo 
P = the percentile. 
X, = the measure falling а 
we wish to find. 

з і r the 0 
Ee Es, Segue hear us Ба прав рч meu 
ively, in the original list. ^ 


t this percentile level, which 
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L = the lower limit of the class, somewhere within which 
X, lies. 

S — the sum of all the frequencies up to, but not includ- 
ing, this class. 

F = the frequency within this class. 

N = the total of all the frequencies. 

C — the size of the class. 


Ex. 8.— To find the median of the distribution given in Ex. 8, 
substitute in this formula : 
P = 50. L,thelowerlimit, = 98. S = 3 410 4- 18 + 27 + 


48 = 101. 
F=47, N=250. С ==5. 
Зи 5 (50 x 250 Е | of 
Xp = 98 + z 105 — 101) — 100-5 (to the first place 
decimals). 


By applying the same formula, we find the 99th, 90th, 75th, 
25th, 10th, and 1st percentiles to be 128, 117, 109, 94, 86, an 
77, respectively (omitting decimals). 


Comparison of Median and Average.—1f the distribution 
is a perfectly smooth and normal one, the median is identi- 
cal with the average. In any fairly regular and symmet- 


rical distribution they will be quite close to one another. | 


Thus in Ex. 8, where the median is 100-5, the average. 
happens to» be 100-68. For rough work the median 15 
often used as an approximation to the average, since its 
computation is far easier, Sometimes, indeed, it gives = 
better notion of the ‘ general run > of the measures than 
does the average. For example, the average age of ? 
School class may be unduly raised because fhe class in- 
cludes two or three pupils much older than all the rest: 
They cannot very well be ‘omitted in calculating the 
average, but if the median is found, they no longer exert 
a disproportionate effect upon the result, Similarly, if 2 


| 
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free word association test is applied to a child at a psycho- 
logical clinie, most of the child's responses may be given 
within two to three seconds, but some of them may take 
much longer, and a few words may evoke no response at 
.all. In such a case the average cannot be determined, 
and the median is the only possible measure of the child's 
speed of response. 

Here is another instance where the median may be very 
useful. Suppose an investigator wishes to find the ap- 
proximate average speed with which a group of pupils or 
students can perform а certain task, such as handwriting. 
'There is no need for him to time each person separately, 
nor to count the number of words each one has written in 
a certain time, and then add up and average the results. 
Instead, he may set the whole elass writing & standard 
passage, and tell them to hold up their hands directly they 
finish. Then he may count the hands raised, and stop the 
test as soon as the median person raises his hand, noting 
the time at which this occurs. The median speed cf 
Writing for a class of, say, 41 persons can be calculated 
from the time taken by the 21st quickest person. 

When a distribution is strongly skewed, the median 
differs appreciably from the average, being larger if the 
Skew is negative, smaller if it is positive. In Ex. 1 (p. 15), 
the median is 14, the average 12-86. This difference is 
Often used as a measure of skewness (the formula сап be 
found in more advanced textbooks). 

Uses of Percentiles —The percentile levels in a distribu- 
tion clearly provide us with as complete a numerical 
description of that distribution as we may need. For 
example, the figures quote above for the distribution of 
students’ scores on the Cattell Intelligence Test (the 99th, 
90th, 75th, 50th, 25th, 10th, and Ist percentiles) give a 
reasonably full answer to the question—how did the 
Students do on the test? Knowing these levels, we сап 
begin to interpret the significance of any one person's score, 
Since we can tell how he stands relative to this group of 
students. 'Thus a score of 92 on the Cattell test "would 
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not be а very good one for а training college student. It 
falls at about the 20th percentile, which means that 80% 
of the students did better at the test, only 20% did worse. 
But when we are told that the same score falls at the 94th 
percentile for normal adults (not students), we realize that 
it does represent quite high ability. The applications 
of percentiles to the interpretation of school marks and to 
test norms will be considered later. 

Inter-quartile Range.—The range of scores from the 
25th to the 75th percentiles (0, to Q,) is often called the 
inter-quartile range. It tells us what the middle 50%, of 
the group accomplished. Half of this range is often 
referred to as Q, the semi-interquartile range, and this is 
used as a measure of the degree of spread or dispersion of 
the distribution. When the distribution is sinooth and 
normal, @ is usually equal to about one-eighth of the total 
range of measures. In Ex. 8 (р. 12) the total range is 61, | 


09 — 94 525 y 
and Q = 109—9 = 7:5, which is one-eighth of 60. 


CALCULATING THE AVERAGE OR MEAN 


More accurate statistical treatment of distributions 
demands the use of the average and standard deviation in 
place of the'median and Q. Everyone knows how to 
calculate an average or arithmetic mean—often referred 
to as ‘the mean '—but most people use an unnecessarily 
complex method, which is very liable to lead to mistakes. 
We will therefore outline this and the shorter alternative 
methods. 

(1) The ordinary method consists in adding up all the 
measures and dividing by the number of cases, Expressed 

=X 


as an algebraic formula, M = Ww: М = ће mean or 
average, X = any one measure, and У = the sum of all 
the . . . , hence =X signifies the sum of all the separate 


measures. As a rough general rule, this method may be 
used when N does not amount to more than about 20. 
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With larger numbers it becomes very inefficient, unless an 
adding machine is available. 

(2) Often it is quicker to tabulate the measures, to 
multiply each X by its F, and then to sum. The formula 


mow becomes М = ===. 


Ex, 9.— The method is here applied to the distribution 
already given in Ex. 1. 


x F FX 
20 2 40 
19 2 38 
18 8 54 
17 7 119 
16 4 „64 
5 6 90 

14 8 42 
18 4 52 
12 2 24 
n 2 22 
10 2 20 
9 3 27 

8 2 16 

T 1 7 

6 2 12 

8 i 5 

4 2 8 

8 0 0 

2 1 2 

Ў 1 1 

М = 50 648 

648 
M = go = 12:86 


ll further if the measures 


(8) The process is shortened sti 
s the midpoint of each 


are tabulated in classes. Jf @ i 
x (Fa) 

NC 
, Ex. 10. — The method is here applied to the distribution given 
m Ex, 2, E 


class, then M = 
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x = Е Fa 
18-20 . . 19 7 133 
15-17 . . 16 17 272 
12-14 . . 18 9 117 

9-311 . . 10 7 70 
6-8 В 5 T 5 35 
8-5 . s 31 12 
0-2 . se X 2 2 
N = 50 641 
x (Fa) — 641 
м = 941 _ 12.82 
50 


Note that this value of M, 12-82, is not precisely identical 
with that obtained by the previous method, 12-86, though 
the discrepancy is only 0:3%. The reason is, of course, 
that the figures are somewhat distorted by being grouped 
into classes. We are regarding the 20's and the 18's 25 
19's, the 17's and 15's as 16's, and so on. In the long run 

‘these distortions tend to cancel one another out, and 
therefore have very little effect upon the finai result. 
They only become serious if the grouping is too coarse, 
i.e. if the number of classes in the table is too small. 

(4) The amount of computation, when large numbers 
are involved, may be much reduced by applying the device 
of the arbitrary or guessed mean. The following illustration 
may explain it. Suppose we wish to find the average 
height of a class of children, we could first of all measure 
each child's height from the ground to the top of his head, 
add up the figures and divide by N. An alternative plan, 
which would do equally well, would be to place each child _ 
by а 8-foot table, and measure the height from this to. the 
top of his head. We would average the measures 85 
before, but would add 36 inches to our final result. Again, 
we might take a still higher table, say 50 inches, and 
measure from this up to the tops of the heads of the larger 
children, down to the heads of the smaller ones. If we 
added together the former, subtracted the latter, divided 
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by N, and finally added 50 inches, we should, of course, get 
exactly the same result as before. But the figures with 
Which we should be dealing would be far smaller than in 
the first instance. We would be averaging, e.g. + 6, + 1, 
— 2, + 8, — 5, etc., instead of 56, 51, 48, 58, 45, etc. 

'The method consists, then, in making а rough guess at 
the mean, listing the amounts by which each measure 
deviates from (is greater or less than) this arbitrary mean, 
averaging the deviations, and finally adding the average 
to the arbitrary mean. 


м _ E (ED) | 
Бре + А. A is the arbitrary or guessed mean, 
and D the difference between each X and A. 


Ex. 11.— 

х р Е FD 
20 Lom 2 14 
19 + 6 2 12 
18 8 3 15 
17 + 4 7 28 
16 + 8 4 12 
15 + 2 6 12 
14 + 1 Б 8 
18 Р о 4 +96 
12 RU 2 2 
T. — 2 2 4 
10 4 — 8 2 6 
9 3 = #4 3 12 
BN I 2 10 
ue T — 6 H 5 
6 : = m 2 14 
DL as = 8 1 8 
4 3 EN 2 18 
8. . —10 0 0 
2 ESTE d: 1l 
= 712 1 12 

— 108 


A=13 M= ze 18 = — 014 + 18 = 12:86 
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The method is here applied to the distribution given 
in Ex. 9 (p. 49), and it yields precisely the same result for 
M as was obtained by the ordinary method of averaging. 


Note that in this example 2) is a minus quantity, and is 


therefore subtracted from 4. If we had chosen A = 12, 


(Ер) 


ү would have been a plus quantity, namely + 0-86. 


(5) By far the most efficient method when N is large, 
say 100 or more, is to average, not the actual amounts, but | 
the numbers of classes, by which the tabulated measures 
exceed or fall short of 4. For instance, suppose we have 
a table of the distribution. of ages of all the inhabitants of 
a town, grouped into classes of ten years, with midpoints 
of the classes at 5, 1, 25, ete., years, Wishing to calculate 
the average age we take 85 as our arbitrary mean. 
Obviously there is no need to sum the numbers who аге 
10, 20, 80, . . . years more or less than this 4. We might 
just as well sum those who are 1, 2, 8, . . . decades more 

or less. Then, when we have calculated the average devia- 
tion from 4 in decades, we must multiply it by 10 to 
turn it back into years, before adding it to, or subtracting 
it from, А. 

The formula for this method is M — A +A, 
where d is the number of the class above or below that 


class of which A is the midpoint, and ¢ is the size of each 
class. 


, Ex. 12.— The method is here applied to the distribution given 
in Ex. 8 (p. 32). = 
X 


d F ка 
183-137 +6 T 6 
128-132 . +5 1 5 
123-127 . +4 6 94 
118-329 +83 11 33 
113-117 . +2 22 4A 
108-112 . +1 26 26 
108-107 . = 0 85 -- 188 


e === 
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x d F Fd 
98-102 . e =i 47 4T 
93-97 . x 48 86 
88-92  . > =8 27 81 
88-87 . 2 —4 18 72 
78-82  , у = 10 50 
78-77 . = ле 8 18 

280 — 854 


The arbitrary mean is taken as the midpoint of the 103-107 
Class, і.е, as 105. Classes above and below this are numbered 


consecutively. 


(Ей) _ +138 — 354 _ _ 0-864 
NC 909 Ee 


This is the average number of classes by which М differs from 
1. We must multiply by 5 to get the average deviation in scores, 
Since с = 5 


М = 105 — 5 x 0-864 = 100-68 


Once the student is accustomed to this method he will 
find it far quicker and more accurate than any other. ЈЕ 
In this example he had used one of the first three methods, 
he would have been involved in a sum totalling about 
25,000. Points which require special caution are: (i) 

аКїп sure what measures are included in a class, and 
What is the size of c; (ii) determining A correctly ; (iii) 


multiplying A by c before adding it to 4; (iv) making 


Sure! of the sign of this quantity before adding it to, or 
Subtracting it from, A. 


Tur MEAN VARIATION 

The Mean Variation or Average Deviation (M.V. or 
D.) is occasionally employed as an index of the extent 
Хо which the measures are spread out on either side of the 
lean. As its name indicates, it is the average of all the 
differences between each measure and the mean (or some- 
Mes the median), regardless of whether these differences 


are 5 
Plus or minus. = 
5 
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Ex. 18.—In this very brief distribution, the mean is 5:6. 
The D column lists the deviations. M.V. — m = 2-08. If 
the data were in the form of a frequency distribution table, the 
same method would apply, but the formula would be M.V. = 
3(FD) 

хм 


х р 

9 8-4 

7 1-4 

6 0-4 

4 1-6 

2 8-6 
N=5 10-4 
М = 5:6 


Clearly the M.V. will be larger the more widely dispersed 
or spread out the measures. Its value is a little greater 
than that of Q, being generally about one-seventh of the 
total range of measures. (This does not hold in Ex. 18 
owing to the shortness and irregularity of the distribution.) 
The M.V. happens to be of very little use for purposes of 
further statistical treatment, hence it is usually passed 
over in favour of the Standard Deviation. 


Tue STANDARD DEVIATION 
The S.D. is also a kind of average of all the deviations 
from the mean, but it differs from the M.V. in that the 
deviations are squared before they are summed, and the 
square root of their average is taken. 


Ex. 14.—The working is here shown for the same short dis- 
tribution as in Ex. 18, 


x D D? 
9 8-4 11-56 
КА 1:4 1:96 =D? 29-20 
6 0-4 0-16 B bs mis 5:84 
d 1-6 2:56 с = 2:42 
2 3-6 12-96 
N=5 29-20 
M=5-6 
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As the mean is seldom a whole number, the squaring of 
deviations from it is troublesome. Generally it is simpler 
to list the deviations from some arbitrary mean which is а 
whole number, and then to apply a correction according 
to the following formula : 


XD: 
N 


Note that the correction is invariably subtracted, whether 
its Sign before squaring is positive or negative. 


= — (M — А) 


Ex. 15.—In the same distribution, let A = 6, ie. the nearest 
Whole number to M. 


x D p: 
9 8 9 p: 30 
T 1 1 N ^5 
i 0 0 (М — А) = (56 — 6) = 0-16 
2 + x са = 6-0 — 0-16 = 584 
— а = 2:42 
80 


The same formula may be extended to several different 


types of distributions. 
(а) When the distribution is a short опе, and all the 
тезите are whole numbers, 4 may be chosen at zero. 
The Ds will then be the ои X's, so that : 


Ex. 16, 


x ха 
2 81 ХХ? 186 _ 57.9 Ме = 81:86 
7 49 N 5 
6 36 оз = 87:2 — 81:36 = 5:84 
4 16 с = 2:42 
2 4 

186 


0) Often the mean is not known, but is calculated at 
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the same time as the S.D. Now M — e +A. Hence 


an alternative version of the same formula is : 


(с) When the measures are grouped into classes, the 
formula becomes : 
al A/ Е) Cem 2 
i UN ОМ. 
Here c is the size of each class, and d the deviation of 


А A "n 5 
each measure from an arbitrary or guessed mean in ‘term: 
of number of classes. 


Ex. 17.—The formula is here applied to the distributio? 
whose mean was calculated in Ex. 12. It was found there t 


з 
x = — 0:864. Hence the correction (202) = 0:746. 

x Е а d Fd? 
188 + 1 + 6 36 36 
128 + 1 + 5 25 25 
128 + 6 Ta 16 96 
118 + 11 43 9 99 
113 + 22 + 2 4 88 
108 + 26 +2 1 26 
103 + 85 0 0 0 
98 + 4T —1 1 47 
өз + 43 =a 4 172 
88 + 27 ERE 9 248 
88 + 18 —4 16 288 
78 + 10 —5 25 250 
73 + 3 —6 36 108 
250 1,478 

Z(Fd? 1478 

М = зо = 5912 


с = 5V5.912 — 0-746 = 11:86 


One more instance will be given, to illustrate the calcula 
tion of the mean and S.D. in one operation. 
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Ex. 18.—The following distribution of school marks is 
re - classes of two. Take 4 at the 11-12 class, 


x F d Fd Fd: 
19-20 2 +4 8 82 
17-18 5 +8 15 45 
15-16 4 +2 8 16 
13—14 4 +1 4 4 
11-12 8 о +35 0 

9-10 7 —1 i T 
7-8 4 —2 8 16 
5-6 4 —8 12 86 
8-4 2 —4 8 82 
1-2 1 —5 5 25 
Х = 86 — 40 218 
(Ей)  —5 
=== = —- = — 0189 
N 36 СЕ 


This is in terms of classes. In terms of marks, 


X(FD) — 
—N = 0:278 


Thus М = 11.5 — 0:278 = 11-22. 


(Еа): 218 X(Fd)* _ 9.9 
N = z6 = 5917 2 ) = 0:019 


о —24/5:917 — 0:019 = 4:858 
Би пене caleulations are quite quickly and easily accon- 
сетей with a little practice, though some care is needed in 
€cking the arithmetic. For example, the whole of the 


у ove calculation, using 4-figure logarithms and checking 
2 Slide-rule, occupied the writer less than four minutes. 
han Q or the M.V. This is 


to d S.D. is always greater than @ 0 
NE expected, since any big deviations from the mean 
lve much greater weight in the S.D. than in the M.V. 
eng squared before they are averaged. It generally 
unts to one-fifth or one-sixth of the total range, or 


So i +. H 
Ti Metimes more if the distribution is irregular, as in Ех. 18. 
li check to our result. 


2 or c — 10, we should 


amo 


Ех, 18 we had obtained с = 
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have known that there had been some mistake in the 
computation. 

The Significance of the S.D.—Before we go further with 
this description of statistical methods, it would be well to 
adopt a more uniform notation or set of symbols. For 
reasons which will appear later, it is often preferable to 
deal not with original measures (which are denoted by the 
symbol X), but with measures expressed as deviations 
from their own mean. The latter will be denoted by 4 
Thus when X = 80 points of I.Q., and the mean I.Q. = 100, 
æ = —20. Again, actual frequencies (which are denoted 
by F) are often replaced by proportionate frequencies; 
that is, the actual frequencies divided by N. And since 
frequencies are always plotted graphically along the 
axis, they will be denoted by y. For example, in Exs. 12 
and 17, 26 students obtained test scores of 108 to 112: 
The mean was 100-68, and N was 250. The гапа y values 
for such students are therefore : 

æ = 110 — 100-68 = + 9:82 y= 25. = 0:104 

Now, it is well known that many graphs, such as the 
straight line and the parabola, can be represented. by 
certain algebraic equations. For instance, y = ka gives 
a straight line, running through the origin, whose slop® 
depends on the size of k. In the same way a normal dis- 
trıbution curve is represented by an equation which 107 
volves x, y, c, and certain constants. This is a mos 
important fact, since it signifies that any normal distribu- 
tion can be completely defined or numerically described 1 
we know the mean and S.D. alone. It will be remembere 
that the alternative, percentile, system of defining distribu" 
tions required the citation of some half-dozen percentile 


с у2л 
No attempt will be made to deal here with the theory of probability 
from which this equation is derived, nor with errtain other (1285 
common) types of distribution curves for which other equations are 
ava'lable. Cf. Dawson (1983), Kelley (1921). 
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levels in order to provide & reasonably complete numerical 
description. This system, based on the mean and S.D., is 
therefore much more efficient. But it is restricted to 
normal, or practically normal, distributions, whereas the 
Percentile system will apply to any shape of distribution. 
If, then, we know the S.D. we can deduce y for every 
value of x, that is the frequency of every possible measure. 
For instance, although we cannot measure the intelligence 


quotients of the whole population of Great Britain, yet, if 
ble which is normally 


we may assume that the 1.0. is a varia 
distributed, with a mean of 100 and & S.D. of 163, we can 
State at once just what proportions of the population have 
1.0.5 of 100, 101, . . . etc., or what proportion lies below 
97 1.Q., or between 120 and 180 I.Q.,.and so on. 
Furthermore, there are available in most statistical text- 
books tables which will tell us the value of y corresponding 


to each value of =, regardless of what the variable may be 
с 
(L.Q., height, handwriting speed, etc.). These are known, 
as "probability integral’ tables. The graph given in 
Fig. 18 is based on such tables, but is probably easier to 
Use, and is accurate to two OT three significant places. 
Against © is plotted, not y—the actual proportion of 
с € 

Measures which deviate from the mean by this amount — 
but the proportion which deviate by this amount or more. 
Taking the lowest curve, it can be seen that nearly 0:16 
9r 16%, of measures in any normal distribution differ from 
the mean of the distribution by 1X 9 ог more. For example, 
IF the S.D, of I.Q.s is 163, then about 16% of the population 

ave I.Q.s of 1161 or more, and, equally, about 16% of the 
Population have I.Q.s of 881 or less. The upper curves 


x 
a : E 

Te for dealing more accurately with larger values of E and 
Smaller values of y. Each curve is, as it were, 8 10 times 


negnification of a section of the curve below. For 
"stance, we can read off from the first curve that about 
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0:023 or 2-3% of the population differ from the mean by 
2c or more, but the second curve allows a more accurate 
reading, namely 0-0228. 


Ex. 18.—In a school population of 25,000, how many children 
may be expected to have LQ.s of 120 to 130? And what will 
be the highest 1.0. among the least intelligent 1,000 of these 
children ? , 

Intelligence is something which varies continuously ; that is 
to say, every possible value of I.Q. in between 119 and 120 can 
occur, although we usually calculate it only to the nearest whole 
number. Strictly speaking, then, we want to know how many 
children have 1.Q.s between 119-50 and 130-50, and to exclude 
those whose I.Q.s are 119-49 or less (who would count as 11975), 
also those whose I.Q.s are 130-51 or more (who would count "i 
18176). Now 119-5 and 130-5 differ from the mean by 19:5 "o 
30-5, that is by 1-180 and 1:850, taking с as 16). Reading о 
from Fig. 18, the corresponding proportions are 0-119 oh 
0:0322. Hence the proportion of the population within t 
given limits is 0-119 — 0-0322 = 0-0868 = 868%. 8:68% О 
25,000 is 2,170, the answer to our first question. Ё f 

In the second question, 1,000 out of 25,000 is a proportion 9 

"0-040. From Fig. 18, 0-04 corresponds to an æ value of 1:750: 
1-75 x 16} = 28-875. The lowest thousand therefore have 
I.Q.s of 100 — 28-875, or less. Hence the answer, to the 
nearest whole number of I.Q., is 71. 

Ex. 19.—Does the distribution of intelligence test scores 
given in Ex. 17 (p. 56) conform to a normal distribution ? 
other words, ire the obtained frequencies similar to the fre- 
quencies which would be expected for a truly normal distribu- 
tion with the same mean (100-68) and the same S.D. (11:36) ? 

"These test scores, unlike I.Q.s, are discrete. Yobody coul 
score 1071, however carefully his score was computed. If he 
nearly, but not quite, completed 108 items, his score would still 
be 107. Thus in order to find what proportion may be expecte 
(according to the normal curve) to fall within the 103—107 class; 
we must determine the proportion with scores of 108 or more 
and subtract from it the proportion with scores of 108 or mores 
not the proportions lying between 102-5 and 107.5. The cal- 
culations for all classes are given in the following table. For 
example, 103 is a deviation of + 2-32 from the mean, OY 

+ 0:200; 108 deviates + 7:82 or + 0-65e from the mean. 
From Fig. 18 we find the proportions lying at or beyond these 
two limits to be 0-421 and 0-258 respectively. Hence the pro- 
portion to be expected within the 103-107 class is 0-421 — 


OIX 


3-5 


50 


2:5 


20; 


Hoi 
71 | m 
| 
| 
IF O5 
БЕ 
Кој 
| 
5 
E 
| 
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ба = 0-168. N is 250, and 0-168 x 250 — 41. "The penul- 
nu o lists the expected frequencies (omitting deci- 
el эле the last column gives the frequencies actually ob- 
a he two columns agree fairly closely, but there are 
Fed iscrepancies. At present we lack a method for telling 
^able пег these discrepancies should be regarded as appreci- 
( or important, but such a method will be described later 
Pp. 102-3), The full answer to this question will therefore be 
postponed. 


Proportions Proportions Fex- Fact- 

х = z lying at or expected ected ually 
c beyond these in each neach ob- 

133 values of ја class class tained 
188 + + 32:82 + 4-285 000219 00029 1 1 
Quoc Feras + 241 00078 | 090501 аш 
118 + + 22-32 + +197 0-025 0:0172 4. ^6 
118 + + 17:88 - +153 0:062 0-037 9 n 
los ТОТ 1282+ +109 0188 0-076 19 22 
Eds + + 7324+ +065 0-258 0-120 80 26 
ge + + 2324 +020 0:421 0-163 A1 85 
9B + — 268 + — 0-24 0-404 0-175 44 AT 
ВЕ + — 7684+ — 0-68 0246 0-158 40 48 
88 + — 12:68 + — 112 0:188 0-113 28 97 
78 + — 17:68 + — 1:56 0-059 0-074 18 18 
78 + —22-68 + — 2-00 0-023 0:086 9 10 
ee 27-684 eee 0-0073 0-0157 4 *78 
+ — 32-68 + — 2-88 0:00195 0:0073 2 0 


1:00000 250 250 


Ыл On examinin g Fig. 18, we see that the proportions whose 
Es deviate by 2-50 or more are very small, less than Yin 
ins and that those who deviate by 8с or more only number 
Oout l in 1,000. Here, then, we have the reason for a state- 
pnt made earlier, namely that the S.D. usually amounts 
© one-fifth or one-sixth of the total range of a distribution. 
pu When N is about 100, and the distribution is a regular 
ће, we shall ordinarily find the measures ranging from 
2-50 to — 2-50, ће. a total of ба. And when N ap- 
Proaches 1,000 we may find scores running from + 8c 


9 — Ва, a total of бо. 

ages is yet another way of inte 
5, which will be useful to us 

Probabilities. Since 0-159 of person 


rpreting normal distribu- 
later, that is in terms of 
s may be expected to 
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obtain measures 1с or more above the mean, the proba- 
bility that any one person will measure this much or more 
is0-159. Тће probability that he will measure less is 0-841. 
In other words, the odds are 841 to 159, or about 53 to 1 
against his measure being as high as or higher than this. 
Similarly the probability of а measure as high as 4- 8c or 
more is 0-00185, and the chances against it are 740 to 1. 
This means, for example, that we should probably find only 
one child in an ordinary elementary school of about 740 
pupils with an I.Q. of 150 or more, if we assume the I.Q.s 
of school children to be normally distributed with a c of 
161, since 150 is just over 100 +8 x 161. Actually the 
c of I.Q.s among elementary school children is usually 
much lower, which makes the odds against high I.Q.s much 
greater. If c is 123, then 150 is 4c above the mean, an 
we can deduce from Fig. 18 that only about 1 in 81,000 
reaches this level. 


CHAPTER IV 


I 
NTERPRETATION OF SCHOOL MARKS AND TEST 
SCORES 


п ning or Comparing Distributions of Marks.—In Chap. 
examin and that the marks awarded to school work or 
Bon ations are often so erratically distributed that it 

Possible to interpret them, as they stand, correctly. 


EU stages in the interpretation of marks involve 
ining and comparing them. For instance, the marks 
or different 


E send questions in one examination, [ e 
u рено in one subject, need to be combined to give 
combi Score. Often the marks on several subjects are 
all-ro ned to yield a general view of the pupil’s or student’s 
E ability. Comparison enters when we wish. to 
Es Whether a pupil or student is better on one subject 
marked some other subject (subjects which may have been 
teache either by the same, or by different, examiners ог 
amin ers). In large-scale examinations where several ex- 
ers each deal with a proportion of the scripts it is, of 


соц » 
aie essential that their markings should be comparable. 
*times, also, age allowances must be made ; the marks 

d to be rendered comparable 


о: 
b Younger and older pupils nee 

"e elimination of differences due to age." ч 

W, попе of these processes can be carried out directly 


u 

пати the marks to be combined or compared occur in 

mental Similar frequency distributions. This is а funda- 

Very al statistical principle, very elementary in nature, yet 

tools widely neglected. But with the aid of the statistical 

dete described in Chap. Ш, we are now mM a position to 
Tmine whether such distributions are sufficiently 


Davies and Jones (1986) discuss this point and describe & simple 


1 
meth, 
od 2 
9f making age allowances. 
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similar, and if they are not, to adjust them in such a way 
as to make them so. These tools will further assist in 
interpretation, since they provide a scientific definition of 
those exceedingly vague conceptions—the goodness ог 
badness of a mark or test score, and the ease or difficulty 
of an examination question or test item. 

Distributions may be dissimilar from one another in three 
respects : (1) differences in their averages arise, forexample, 
when one examiner is far more lenient than another; 
(2) differences in their dispersions may arise when one 
examiner spreads out his marks far more widely above and 
below the average than another; (8) differences occur in 
their shapes, e.g. normal, skew, rectangular, etc. For the 
time being we will consider only the most important shape, 
the approximately normal. We may then state that marks 
can be directly combined or compared only if drawn from 
distributions with the same mean and the same S.D. 

The Importance of Equivalent Siandard Deviations.— 
Identity of S.D.s is the more important of these two require- 
ments, though it is less often recognized. In many 
instances of combination of marks, the averages do not 
matter in the slightest. 


Ex. 20.—Suppose that the final marks of a group of pupils 
or'students depend on their results in three subjects, A, В, ап 
C, and that the distributions are as follows : 


Subject Mean Range S.D. 
A 60 40—80 7 
ВІ a . 50 80-70 ТА 
С . . 80 60—100 7 


Now, a pupil who is average on all three subjects obtains 8 
total of 190/800, and another who is average on two subjects 
but top on the third scores 210/300 whatever the subject. 


Ex. 21.—In contrast to Ex. 20, let us now combine the follow- 
ing sets of marks, whose S.D.s differ. 


Subject Mean Range S.D. 
A, vs . 60 40-80 7 
Hiis . 50 10-90 14 


" (ORDIN . 80 70—90 E 
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3 о is average all round again scores 190. But now 
210, 230 T : average in two subjects and top in a third obtains 
respectively. 200 according as the third subject is A, B, or C, 
the pv | Тһе fact of doing well in subject B (which has 
Bihss the emn actually raises the score the most because 
: in raisin oe S.D. And doing well in С has the least effect 
Оше + - final total (although it has the highest average), 
Performa as the smallest SD. Similarly if we compare the 
one тры of three pupils who are at the lower quartile on 
five m is ut average on the other two, we find their respec- 
big effect to be 185, 181, and 188. Doing badly on B has а 
ct, but doing badly on C has little effect. 


illustrated by these examples, is 


The general principle, 
bined, their relative 


ї 
eig several sets of marks are com 
the ву ie. their respective degrees of influence upon 
their S а are proportional not to their averages but to 
toone. a If in a final mark we wish to give more weight 
© Eu ject than to another, e.g. twice as much weight 
not b en as to mental arithmetic, we shall accomplish it, 
ан making the average mark on the former twice the 
of t E on the latter, but by making the range and S.D. 
Weightine i twice as large. The usual procedure in such 
is wa is to multiply the former set of marks by two. 
the а ay produce the desired effect, not because it doubles 
Average, but because it doubles the S.D. The same 


e 

W. ile сени follow if the average remained unchanged 

Жаза е deviation of each pupil's mark from this average 

ti Oubled. When the marks derived from several ques- 

Proportis single examination are to be combined in certain 
rtions, this same technique of adjusting the S.D.s 


а 
ст than the averages for the questions should be 


реа. 
euet to combine or compare two gets of marks whose 
et in UM are dissimilar, we should then express each 
multi > form of deviations from their own means, and 
S.D Ply one set by a figure which will raise ог lower its 
о the same level as that of the other set. For 


insta 4 г 
nce, in the two sets В and the ratio of 


-D.s C in Ex. 21, the c 
5 is 14 to 31, or 4 to 1. The deviations in C will 
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therefore become equivalent to those in B if multiplied by 


x 
4. Alternatively, each mark may be expressed in = form. 


80 — 60 
The top marks on A, B, and C thus become T 


90 — 50 

14 
So converted, they are all equivalent. 

The Importance of Equivalent Means.—Naturally the 
averages of distributions are important under some circum- 
stances. If marks on several questions or several subjects 
are to be combined, and alternatives are allowed, so that 
each pupil's final mark is derived from a selected number 
of questions or subjects, then it is extremely essential to 
ensure equivalent means. Thus in Exs. 20 and 21, if 
pupils took either A and B, or A and C, those who were at 
the average level would total 110 on the former combina- 
tion, 140 on the latter. Again, when scores are to be, not 
^ombined, but compared, it is obvious that differences in 
means will entirely upset the comparisons. The adjust- 
ment of discrepancies in means is easy—merely add or 
subtract so much from all the marks in one subject or 
question until its average is the same as that for the other 
subject or question. More often than not, however, dis- 
crepancies in means are accompanied by discrepancies in 
S.D.s. In that case resort must be made to the technique 
already described, according to which each mark is 


‚ апа = 5i m АП of these equal + 2:86. When 


expressed as zo to some modification of this technique 


(cf. pp. 71-8). 


Equatine or Marks Bv THE PERCENTILE TECHNIQUE 


Although the statistical treatment of distributions in 
terms of means and S.D.s is far from difficult, it may 
admittedly lie outside the scope of the ordinary teacher or 
examiner. Further, it cannot be used when the distribu- 
tions bear no resemblance to normality. For instance, if а 
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headmaster, to whom were submitted the marks listed in 
Ex. 5 (p. 34) wished to compare the pupils’ standings on 
reading and geography, he could not use this technique. 
The alternative, percentile, system would, however, be 
Quite suitable for making such a comparison, and it is 
probably simpler for the uninitiated to comprehend. (It 
does not even involve looking up squares and square roots, 
as does a technique based on the S.D.) 
Percentiles will provide a uniform scale in terms of which 
Comparisons may be made between any sets of marks. 
tever the test or examination which a group of pupils 
or students takes, a certain percentile level always means 
€ same level of achievement (relative to that group of 
People), For instance, on looking at Ex. 5 we might be 
inclined to suppose that 8 marks for reading represents a 
tter achievement than 3 marks for geography ; certainly 
A е pupils themselves would infer that this was 50. But 
prs they are precisely equivalent, since both marks 
à at the 25th percentile, Фу. Тһе teacher may, of course, _ 
we tol the general standard of reading in pe as 
800d and the geography poor. This is an objectio 
он will dendi ie cud by critics of this chapter, 
We will show later why we believe it to be baseless. 
ercentile Graphs.—One of the best ways of comparing 
i © Sets of marks is by a 7-point percentile graph, which 
illustrated in the following example. = 


Frequencies 
Mark English Arithmetic 
90 + . я 1 5 
80 + . * . 10 11 
wW. Я . 15 18 
60 + . : s dT 15 
Boch „ % = 8 10 
аде . « = © 2 
Bts s 4 7 1 
4, . ® 9 
БТ 57 


" Ex 2 = 
· 22) f 57 pupils, aged 12 +, 
in The + ws the marks of 57 pupi; j 
а Scottish е ui peer in English and arith- 
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metic. Though the tabulation is coarse, it indicates that the 
two distributions differ in their means and S.D.s, and that, while 
the first is roughly normal, the second is negatively skewed. 
The procedure is, first, to determine the 99th, 90th, 75th, 50th, 
25th, 10th, and Ist percentiles for each set of marks. From the 
original lists these are found to be 90, 84, 77, 67, 59, 49, 39, and 
97, 89, 81, 71, 60, 53, 85, respectively. Draw a graph, at 
Shown in Fig. 19, where English marks are plotted along the 
Y axis against arithmetic marks on the X axis. Plot these 


ENGLISH 
% 
90 


80 


70 


60 


100, 
о o, 
20 30 40 50 60 70 80 Ве ЕТІС % 
Fic, 19.—Comparison of two sets of school marks by means of a seven” 


point graph, 


seven points (X = 97, У = 90; X= 89, У = 84; etc.), and 
connect them up by a straight line or smooth curve which runs 
as nearly as possible through them all. This line may be ex 
tended at both ends so as to cover the total ranges of marks- 
From it we can now read off the mark on the English scale 
which corresponds to each mark on the arithmetic scale, an 
vice versa. Thus 41 in English is equivalent to 39 in arith- 
metic, 80 in English to 85 in arithmetic. 


Often it is not necessary to determine as many points 25 
seven in order to fix the graph. Three points, preferably 
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the z 
E ary pom and 10th percentiles, may give sufficient 
been used ~ Oke) that Just the same procedure could have 
Petes апа a if the distributions had differed widely in 
out of 40 _—. e.g. if one subject had been marked 
i › the other out of 200. 
t CANC tii of Marks by Means of Percentile Levels.— 
or trainin е a great step forward if examiners at universities 
. Tequired E centres, and суеп at secondary schools, were 
median o publish with their mark lists a table of the 
the GE eem and some of the other percentiles for 
and stude " hich they had actually awarded.  Lecturers 
um "i s would soon learn the significance of these 
markin т would then be able to see what standards of 
stu „еч : vifa various examiners were adopting. And each 
OW rra interpret the excellence or poorness of his 
Was best S in various subjects. The subject in which he 
obtained рт not necessarily ђе the one in which he 
Which his e highest examination mark, but the one in 
obtained t еее level was highest. « Similarly, if he. 
Show that e same mark in two subjects, this would not 
Suppose nis ability at the two was the same (as he often 
levels - S at present), unless it happened that their percentile 
i Were the same. 
t еи of Percentiles.—1t has some 
Pupils? achers and examiners shoul 
levels E students’ marks at all, but only t 
Present his would, indeed, be much fairer than most 
Sent nd stems. „The 50th percentile would always repre- 
5 bd achievement; percentiles approaching 100 
low Mud always represent the same very high and very 
га ум рај But such а plan is defective in that 
Childre not be intelligible to interested persons, e.g. to 
| failt n or to most of their parents. They would probably 
make no pretence of providing 
as do ordinary sets 


times been suggested 
d not publish their 
heir percentile 


О тел]; 5 
Gi te that percentiles 
ute evaluation of achievements, 


1 
This 
taken 7) Scheme would not, however, 


P. 79), У а very small number, say 


‚ be feasible in an examination 
less than twenty students (cf. 


6 
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of marks (albeit the pretence is false). Percentiles only 
indieate achievement relative to the particular group or 
groups of pupils who have been marked. In fact, they are 
the same as rank orders, except that they always run from 
100 to 1, whereas rank positions run from 1 to N. 

A more important defect is that percentiles, although 
invaluable for purposes of comparing distributions, may be 
very misleading if used for combining distributions. The 
reason is that percentile units are very far from equivalent 
to one another, since they are far closer together at or near 
the median than they are at or near the extremes. It can 
be seen from the * probability integral? graph (Fig.18) that & 
percentile level of 90, iie. the measure above which only 


a * a 4 
10%, of the measures lie, corresponds to an = score of 


-F1:28. The 80th similarly corresponds to + 0:84, the 
60th and 50th to + 0-25 and 0:00. Thus the difference 
between the first two of these, 0-44, is decidedly greater 
„than that between the second pair, 0-25. Actually the 
99th, 94th, 78th, 50th, 22nd, 6th, and 156 percentiles are all 
aproximately equally spaced, in the sense that a child who 
is at the 99th percentile is as superior to one at the 94th, 
as the 94th is to the 78th, or the 78th to the 50th. The 
consequence of this inequality of units is that, although We 
may, if we wish, average or otherwise combine a person 8 
percentiles on different tests, we shall obtain thereby quite 
different results from those obtained by averaging his scores: 
Ex. 23.—Pupil A’s marks on two examinations lie at the 
98th and 54th percentiles; Pupil B's marks on both examina- 


tions lie at the 85th percentile. A's average percentile of 76 
is distinctly poorer than B's 85. But if these figures are соп” 


verted into = Scores, i.e. into units which really are equivalent; 
it will be found that A is actually slightly better than B. For 
the = values of the 98th and 54th percentiles are + 2-054 and 


+ 0-100, average + 1:077 ; whereas the z value of the 85th 
percentile is only + 1-036. 
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E " advisable, therefore, never to use percentiles for 
ete marks. The same is true of rank orders (cf. 
Adoption of a Standard Scale for All School or Examination 
о rks.—'The following solution to this difficulty may be 
п mended, Each chief examiner, headmaster, or 
er authority who is concerned with combining, com- 
а, and interpreting sets of marks awarded by sub- 
Sta; Miners, class teachers, etc., should adopt an arbitrary 
{а ndard scale, He might, for instance, choose a percen- 
ae als ranging roughly from 80% to 90%, with a mean 
eats у, and a S.D. of 10. Now, on such a scale the 7 per- 
y Ше points mentioned above fall at 83, 73, 67, 60, 58, 
at and 37 respectively. Each set of marks submitted by 
teacher or examiner should be converted into this scale 
КУ пе graphical method already described. If this is done 
a € marks from all the markers will be strictly compar- 

° With one another, and they will be expressed in units 
c are (suffieiently closely for all practical purposes) у 
or lvalent to one another, so that they can be combine 
that raged at will, And there is the further advantage 
om the scale corresponds well to the present gonven 
othe, P rion of high and low percentages. Naturally any 

er standard scale can be chosen if preferred, and its 
gr he probability integral . 
twe "t Fig. 20 shows the gra 

for Istributions of Ex. 22 In 
(the о arithmetic and 90 for English are P 
оп 9th percentile point) on the S 

А е standard mark for any other 
819, = xu in MA. Dosen d) 

nglis es 71% stan . | 

Another a "i реке the same result is 8$ Mp 
€ aid of the probability integral graph a ta Је 

drawn up showing the proportionate requency 
крг 2 values of the 99th and 15% percentiles are + ког h s pa 
та "А of the 90th and 10th they аге + and — 1:282, of t 


а: 
Nd 25th they are + and — 0-675. 


i 
t 
оца h 
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STANDARD 
SCALE % 


90 | 


60 70 80 


90 то D 
ISH АМ 
ЕО МЕТІС » 


А Је. 
Fra. 20.—Conversion of two sets of school marks into a standard всё. 


of each mark on the desired standard scale, using the same 
method as that described in Ex. 19, p. 60. ie 
Ex. 24.—The following table shows the proportions to 


Nei: A n 
expected within successive classes of 5 marks, when the тей 
is 60, the S.D. 10. 


20 30 40 50 


Mark Е% 
88—92 Оог} 
88—87 1 
78-82 3 
78-77 61 
68-72 12 
68—67 17 
58-62 20 
58—57 17 
48-52 12 
43-47 64 
38-42 3 
33-37 1 
28-82 0 or} 
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Teac 2 

and mcm POD PREE may be given such a table 
Bicordance мета award marks to their pupils roughly in 
formity tes the frequencies there shown (precise con- 
examiner or om км Or, alternatively, the chief 
tions of unscal T teacher may take each of the distribu- 
and 92 to th, E marks, award а scaled mark between 88 
ind sre e | юр 0 or +% of the examinees, marks between 
Е оны. e next 1% of them, and so on. Only very 
Worse than rs or poor candidates—those better or 
given marks "и in 1,000 of the ordinary'run—should be 
distributi of over 90% or under 80%. For in a normal 

on only a proportion of 0-135% is expected to 


Score а 
more than -+ 85 or less than — 32, 
с 


Ех, 25,.— 
Frequency 

Mark 
20 (a) (b) 
19 Exceptional 
18 cases only 
17 M 
16 
15 1 
14 8 } 1 
18 6 8 
12 12 5 
11 18 a 
10 20 8 
9 18 7 
8 12 5 
7 6 8 
o. we 
4 
3 Exceptional 
2 cases only 
1 
о 

в 100 40 

]ternative standard scale of 


urt 
Marks, (1917) recommends an а 
ased on a similar plan, but with an average of 10 


74 THE MEASUREMENT OF ABILITIES 


2nd a maximum range of 0-20. The scale is given here 
with the frequencies to be expected when (a) 100 and (5) 40 
pupils are marked according to it. Here also exact adherence 
should not be demanded. Any one set of marks for 40 pupils 
may depart from it fairly widely. But in the long run all the 
marks awarded by any one teacher or examiner should show 
the same mean, S.D., and tendency to normality as does this 
standard scale. 


OBJECTIONS TO RATIONALIZED MARKING 


There is no doubt that many teachers and examiners will 
object strongly to the principle upon which this chapter is 
based, that the average and the dispersion of sets of marks 
should always be the same in any one institution. Facul- 
ties in a university which consistently award high marks 
sometimes try to justify themselves by claiming that they 
get a ' better run’ of students than do those which con- 
sistently award low marks. Examiners regard it as only 
natural to award different average marks to the answers tO 
different questions in one paper, since some questions ате 
“done better? than others. Teachers similarly regar 
their classes as ‘ better at" some subjects than at others 
and so deserving of higher marks. Again, the class they 
have this year may seem to them an exceptionally ‘ good 
or ‘bad’ one, and the compositions written by the class 
may be ‘better or poorer than usual.’ Finally, those 
raathematical lecturers and teachers who are particularly 
prone to award very widely dispersed marks state that 
their students exhibit a * wider range of accomplishment 
than do students of other subjects. 

Justification for Equalizing the Average in All Sets of 
Marks.—Many of these claims are extremely dubious: 
Thus it is more likely that the university departments 
which award high marks will attract a poorer, rather than 
а. better, run of students, and that the subjects which ате 
marked more strictly will be pursued chiefly by the better 
students who care more for the subjects than for marks- 
Again, one is justified in suspecting teachers! judgments 
about the merits of their classes, since some teachers 
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арреа; 
Ea bed im year, to be blessed with good classes, 
Ext is oc т e cursed with bad ones. According to 
ша ванн known as ‘the law of averages,’ one 
Pe Tong ran b he number of good pupils to be balanced in 
Elton the ct уап equal number ofbadones. Аза matter 
ка ан са | atistical techniques outlined in Chap. V give 
1o vary fro of predicting just how much classes are likely 
ois ZR year to year, or English compositions from 
Here is = , owing to so-called fluctuations of sampling. 
teir nen instance. | A class of 45 pupils averages 60% in 
eral scholastic work on & scale of marks running 


Tough] 
ghly from 80% to 90%. From these data it may be 
age level 


9% and 61 оф, and that 


it is i 
is Miri d improbable that next year's wi 
also the s by more than 8% or 4 
OZen or marking of English composition: 
If the sc ey separate questions in an exi 
(S.D. 8), е of marks employed ranges roug 
likely to then we can predict that the maxim 
tions = occur in the average mar i 
r questions is 3 marks, provi 


least 
86 scripts, and only 2 marks if ther 
ge very widely in his 


Consider 
s, or of half a 


ay 
is ү ех of a group of pupils is far more stable than 
ПО аи supposed. The apparent variations in the 
Pupils th of answers are due much less to variations in the 
tions or e to irregularities in the suitability of the ques- 
Sets a be the topics for compositions. But ifan examiner 
majorit ad question which is beyond the scope of the 
Bu examinees, or which does not allow © 
giving ] eir fullest capacities; there is no justification for 
Serie" ow marks. It is, thercfore, much fairer to give the 
average mark to the actual average script in all cases, 


as у 
as suggested in Chap. I. 
of pupils or students 


Iti > 
s a different matter when а group 
to be selected in such & 


15 
now; кз 5 
n, from objective evidence 
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way as to be, on the average, superior or inferior to the 
mean of some larger Broup. А special class for scholarship 
candidates and one for backward pupils will naturally show 
generally good and poor attainments respectively. Виш 
most cases the teacher or examiner who claims to have à 


specially good or poor set of pupils or scripts is judging: 


purely on the basis of subjective recollections of what he 
regards as average level. When a test or examination 15 
applied to all the pupils of а certain age in a school, or in 
Several schools, then the whole group should be marked 
according to the principles already outlined, and the 
marks for separate classes within this wider group may be 
picked out and averaged. Such averages will then quite 
justifiably show a certain range of variation. But when à 
test or examination is given only to a single class, it is far 
safer not to trust the marker's subjective view of the class 5 
general level, but to award an average mark to the average 
Script actually obtained from this class. Similarly the 
averages for different school subjects should be the same; 

© regardless of whether a class appears to be better at one 
subject than at another, except when objective evidence 
(derived from the application of the same tests or examina- 
tions to other, larger, groups of pupils) proves that the 
class's attainments are uneven. 

The * Goodness’ or * Badness? of а Mark.—We still have 
to face the wider question : what do teachers and examiners 
mean when they say that a pupil's work is good or bad, d 
an examination paper easy or difficult? For endless objec- 
tions of the type discussed above will be raised until this 15 
settled. Most ‘persons who have marking to do obviously 
believe that they possess in their minds some absolute 
standard of goodness and badness by means of which they 
can evaluate a pupil's or a class’s achievement. But 
numerous experiments have proved that such subjective 
judgments of merit or diffieulty, when obtained from 
several different persons, are often widely discrepant. It 
was for this reason that we were forced to conclude, in 
Chap. II, that all marking consists primarily in ranking & 
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Set of performances in order, and that it is а matter of 
relative comparison, not of absolute placement. The 
goodnéss of a child's performance at some task can there- 
fore mean only that he ranks high when his performance is 
directly compared with the performances of other children 
at the same task, under similar circumstances. An achieve- 
Ment is good if only very few people can do as well or 
better, and it is poor if only very few people do it as badly 
ог worse. Similarly, the rational definition of the difficulty 
У x task is the fewness of people who can accomplish this 

SK. 

* Fewness of people,’ in statistical terminology, connotes 


either percentile level or, preferably, E value. A mark is 


Dot necessarily a good one because it happens to be 80% or 


9/10andthelike. But it is good relative to the marks of 
Some particular group of persons if it falls at the 95th per- 
centile, or is 20 де the mean mark obtained by these 
Persons, or if it stands high on a standard scale whose 
Mean and S.D. are known. Such designations of goodness 
or badness are entirely unequivocal. у 
The Ease or Difficulty of Questions.—By approaching 
marking from this standpoint we obtain a rational scale, 


E only of the goodness or badness of pupils’ work, pat 
So of the difficulty of questions or test items. If, 
жу en Sk а correctly by 16 96; 


251 example, five items are answere 

the 81%, 40%, and 50% of a large group of регби 

еп we can state that (for these persons) the items are al 

*qually spaced in difficulty. For these proportions corre- 

Spond to 2 values of + 10, + 0°75, + 050, + 0:25, and 

19 respecti imi tions which are 

pectively. Similarly some questi 

below average difficulty will be equally spaced if answered 
У, say, 91 95, 98196, and 96% of pupils, for here the рю 

Portions of incorrect answers; namely 9%, 6%» and 4%, 

correspond to 2 values of — 1:85, — 1:55, and — 1°75, re- 


“Pectively, 
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Justification for Equalizing the Dispersion in all sets of 
Marks.—The claim that mathematical accomplishments 
vary more widely than do accomplishments in other subjects 
requires some further consideration. It is true that it is" 
usually easier to discriminate between different levels of * 
mathematical than of, say, English composition, ability, ' 
because tests in the former can generally be marked more 
objectively and accurately. Thus in the work of an 
oidinary school class it may be impossible to distinguish 
more than about ten different levels of essay writing, 
whereas perhaps a hundred levels of mathematical skill 
may be distinguished. Although there is some justification 
for attaching greater weight to marks obtained by the more 
efficient measuring instrument, yet this greater efficiency 
is no criterion of the range of ability, as the following 
analogy shows. In a school quarter-mile race, one referee 
may time the boys with a seconds’ hand watch, and find 
that some take 50, others 51, 52,. . . 60 seconds. Another . 
referee may possess a tenth-of-a-second stopwatch and find 

“that some boys take 50-0, others 50-1, 50:2 . . . 59:9, 60:0 
seconds. But the range of ability is the same in both cases: 

There is only one way in which the claim might ђе 
scientifically substantiated. All the pupils of a certain age 
attending both the best, the medium, and the poorest 
schools in a district should be given tests, preferably of the 
objective type, in arithmetic and English, and their scores 
should be tabulated. On studying the scores in any опе 
school it would doubtless be found that the ranges ап 
S.D.s on both tests were somewhat smaller than they were 
among the whole group of pupils, because the pupils in опе 
school are always more homogeneous in their abilities than 
are the population at large. If the increase in homogeneity 
(i.e. the decrease in S.D.) turned out to be greater for Eng- 
lish than for arithmetic, then the claim would be proved. 
But, so far as the present writer is aware, no experiment 0 
this type has ever indicated a tendency for the range of 
mathematical abilities in a group of pupils or students to 
be more diverse or spread out—relative to the range in the 
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population at large—than the range of English. or other 
abilities. Probably then the claim is based (a) on the 
greater efficiency of the instruments for measuring mathe- 
matical ability, (b) on the assumption that there exists an 
absolute criterion of high and low ability. The former point 
15, as we have seen, irrelevant, and the latter fallacious.' 

The Marking of Small Groups.—Peculiar difficulties still 
remain in applying the principles and techniques of this 
chapter to the marking of small groups. The mean and 
the S.D. of marks among, say, half а dozen Honours 
Students may vary far more from year to year than they 
do among 40 or 100 students. ТЕ the average is taken as 
60% this year, the odds are even that it will lie somewhere 
between 57 and 63 next year, but it may drop as low as 
&bout 50 %, or rise as high as 7095. And if there is only 
one candidate who is awarded 60% this year, а single candi- 
date. next year may deserve anywhere from 80% to 90%. 
Until such time as objective tests are available, standard- 
ized on Honours students from many universities, or OVET 
Several years, the examiner can only employ subjective 
marking scales, as indicated in Chap. II (р. 42). 


Теѕт NORMS 
Test norms are tables of the scores of large groups of 
children or adults by reference to which it is possible to 
interpret the significance of the score obtained by any 
Person, or group of persons, to whom the test is applied. 
he mental tester fully admits that а test performance, as 
Such, tells us nothing as to its goodness от bagness, and that 
it must be compared with the performances of other 
Similar persons tested under the same conditions. Thus 
оше tests are standardized, and norms are obtained, by 
Ing applied to large groups of typical 11-year English 
Clementary school children, or to Scottish qualifying 
*Xamination candidates ; others are given to schoo 
t is only fair to mention, however, that Thomson (1980) eee 


that 
greater variability would be found in mathemé 
Other abilities, if it als possible to measure variability absolutely. 
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leavers aged 14 to 15, or to central school pupils; others 
to normal adults, or to university students, and so forth. 
American testers often publish grade norms, i.e. the 
standards obtained by successive school classes in their 
elementary and high schools. But they realize that the 
standards in different schools cannot possibly be uniform. 
In this country the variations are so great that the system 
has scarcely ever been adopted. 

А. mere statement of the average or median Scores of 
such groups is, of course, insufficient. Often the deciles 
are quoted, so that it is possible to state that a person's per- 
formance is, say, as good as that of the best 20% of the 
group upon which the test was standardized. A system 


based on — values oí the various scores would be better 
o 


because, as was shown above, percentile units are so un- 
equal to one another. Such a system has seldom been used, 
presumably because it demands more statistical knowledge 
than most untrained testers possess. Often, however, the 
mean and the S.D. of the scores obtained by the standardi- 
zation group are available, so that the tester can readily 
determine how many times с above or below the mean 
are the scores of his testees, and so can interpret the good- 
ness or badness of their scores. | 

It should by now be obvious that the usefulness of such 
norms is strictly limited by the nature of the standardiza- 
tion group. Norms are purely and simply a summary of 
the test performances of a particular group of testees, and 
must be interpreted with reference to that group. For 
example, if ah American educational test is given to & 
British child and his score falls at the 70th percentile 
according to the norms published with the test, this means 
that he did better than 70% of the American children who 
took the test, but does not necessarily show that he is 
above the British median or average. In point of fact 
American educational tests are almost useless in this 
country because their standards differ so greatly from ours 
(cf. MeGregor, 1984), and, of course, because their content 
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m d ques are often unsuitable. Probably the same is 
Binet t o intelligence tests, but not of the Stanford- 
dope: Chap. IX). Even between England and 
Cose = ifferences are so big that few tests which have 
other à p in the one country can be used in the 
he «i nless fresh norms have been collected there. The 
E. ing efficiency of Scottish teaching methods raises the 
ани | om оЁ Scottish elementary school children on 
English а 8 R’s by anything from 10% to 50% over the 
enm rcs Such apre as we possess, however, 
іе делов, appreciable differences on tests of general 
drm tests have been adequately standardized in London 
GE in sgow, but are not appropriate for use in the provinces 
also qwe areas where the educational level, and possibly 

e intelligence level, are likely to be different. Prob- 


pr it would be better to provide more than one set of 
ms for such different types of school, so that 8 child's 
d to the most suitable set. 


tes 
n performance could be referre 
pparently even the size of school has some effect upon 
Perhaps still more 


Mina! level (cf. Mowat, 1938). 
А E is socio-economic status, since this is known to have 
ned effect upon educational and intellectual levels. 
ЊЕ 3 if а test has been standardized by being applied only 
$ children in a rather poor area, its norins will not be 
Ena for schools in & relatively superior area. 
ey norms for, say, children aged 1+ would be 
this oe only if all types of schools containing pupils of 
Gi ge were adequately represented in the standardization 
still за Norms derived only from one type of school may 
inter e of some use, but the tester must then remember to 
T pret his testees’ results with reference to this type. 
he origin and the adequacy of many of the published 
ct are probably partly 
itary. Poor surroundings and up- 
Wn edet 1 s influences upon education. 
dition there is indisputable evidence of à connection, albeit 
d social class, quite apart from 
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test norms are so dubious that we would advise testers in 
Schools to treat them all with caution, and when possible 
to do without them. Once a whole school class, or all 
the children of a certain age in the school, have been tested, 
each child's standing relative to the rest of this group will 
be known, and will provide most of the information needed 
for educational diagnosis and prognosis, without reference 
to any outside standards. And if the same tests can be 
applied in a school for several years, an extremely useful 
set of ‘local’ norms could quite easily be collected, either 
in terms of percentiles or of с scores. 


AcE Norms 


Age norms may ђе defined as follows. A child who 
obtains a Mental Age or M.A. (on an intelligence test), or 
an Educational Age or E.A. (on an educational test) of 
æ years is one whose performance is as good as that of the 
average child of Chronological Age or C.A. of æ years. For 
example, a child aged 10-0 may do as well in an intelligence 
test as an average 11-0 year old, but only score up to the 
level of a 9:0 year old on tests of reading and arithmetic. 
He is then said to possess С.А. 10-0, M.A. 11-0, Reading and 
Arithmetic Ages of 9:0. At first sight this system of ex- 
pressing test results is far simpler than the percentile ог 


5 » 
gSYStems. It provides а uniform scale of units by means 


of which a child's relative performances on any number of 
tests can be compared. His strong or weak points in 
different school subjects, or even in particular aspects of 
school subjects (e.g. the main arithmetical skills), and his 
superiority or inferiority on verbal and non-verbal intelli- 
gence tests, etc., can be seen at a glance from a table or 
graph of his various E.A.s and M.A.s. Nevertheless, there 
are many serious weaknesses in this type of norm, which 
we shall have to describe in detail. 

Difficulties in Establishing Age Norms.—First there are 
almost insuperable difficulties in collecting standardization 
groups which will be truly representative of all the children 


“Classified according to their occupations, 
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of a certain age. Between 6 and 11 years the elementary 
Schools will provide groups whose average scores are prob- 
ably close to the averages in the total population. Anyone 
аде group will, of course, be scattered through a large 
number of different school classes. But the range of ability 
m elementary schools is restricted, since some of the bright- 
est and dullest sections of the population attend private 
and special schools, respectively. And, as we have seen, the 
range or dispersion of a distribution is quite as important 
as its average. Beyond 11 to 12 years, when elementary 
School pupils proceed to various types of schools, it becomes 
increasingly difficult to secure samples whose averages, let 
alone ranges, are likely to be typical. In a very few in- 
Stances results have been obtained from practically all 
the children in a given district, or from all the children 
om on certain dates! But often the best that can be 
one is to control the socio-economic factor. Suppose that 
а district contains a hundred schools, and that a tester 
Wishes to standardize a test on pupils in ten of them, he. 
Should see that the schools he selects cover the whole range 
Ot socio-economic level, in due proportion. Ап extension 
of this method was used by Cattell (1984) in establishing 
Adult norms for an intelligence test. His testees Were 
is, and the numbers 


each main occupational level were compaxed with the 
Corresponding numbers in the total population. Since he 
ad an unduly high proportion of professionals, and too low 
* proportion of labourers, he reduced the weight of the 
Ormer and increased that of the letter group. before cal- 
culating the mean and the S.D. of test scores for his com- 


med groups, : i 
„Тһе same difficulty in obtaining representative samp e? 
Inders us from tracing accurately the course of intellectual 


pid educational growth from childhood, through adoles- 


at 


1 
Gf Fra: 
- Fraser Roberts, et al. (1985). А 
TE f. Scottish Council for tar in еј Ges), a is 
x i i 73 Scottish с! A 
оці by cent survey of intelligence of 8 x. 


the Research Council (Maemeeken, 
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сепсе, to adulthood. The available evidence, however, 
Shows fairly definitely that this growth is not entirely 
regular, which means that M.A. and E.A. units are not 
equivalent throughout. The year by year increase of in- 
telligence in an average person seems to be reasonably con- 


stant from about 3 to 10 years, after which the rate of ' 


increase diminishes, and the M.A. units become рго- 
gressively smaller, until a constant level is reached some- 
where in the neighbourhood of 15 years. Probably this 
upper limit is reached at different ages on different tests, 
e.g. an average adolescent may show no further improve- 
ment after about 12 on a simple performance test, but 
continue to advance slightly on complex group verbal tests 
even up to 17 or 18.. The upper limit for a very dull child 
is, of course, far lower; ће may never surpass the per- 
formance of an average 10 year old on any test. It is not 
yet certain whether he reaches this limit at the same С.А. 
as an average child reaches his, or somewhat earlier. A 
very bright child. on the other hand, naturally improves 
‘well above the normal 15-year level, and so we express his 
intelligence as M.A. 16,17, . . . up to 22 or even higher. 
But these units are entirely artificial, A 20-year M.A. 
does not mean that a testee’s performance is equal to that 
is a good deal better than the performance of average per- 
sors aged anything from 15 years upwards. We must con- 
clude, then, that M.A. units are so complex, and likely to be 


So uneven above about 12 years, that it would be far better - 


to discard them altogether at such levels. ы 
Lack of Equivalence of M.A. and E.A. Units.—Still less is 
known about E.A. units. Possibly the increase is fairly 
steady from about 6 to 11 years, but then there is a strong 
tendency in most schools to force children unduly, in order 
that they тпау pass the special place or qualifying examina- 
tions. Thereafter some relaxation is allowed. Hence the 
11 to 12-year unit is likely to be exceptionally large and the 
12 to 18-year unit exceptionally small. The units may then 


tail, off among those who receive no higher education, but’ 


_ of an average person aged 20, but only that his performance · 
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among secondary school pupils and university students 
there is likely to be progress well into adulthood. Clearly 
then E.A. and M.A. units do not provide comparable scales 

‘Much above 10 years. A child of 13 with an M.A. and an 
E.A. of 14 is not equally superior in intelligence and in 
education, since 14 minus 18 is a bigger quantity in the 
latter than in the former. Nor, of course, is this superiority 
of 1 year equivalent to the superiority of a child of 7 
Whose M.A. and E.A. are 8 years. 

Not long after Binet put forward the M.A. system of 
scaling intelligence, the discovery was made that a child's 
Telative advancement or retardation do not remain con- 
Stant, but increase as his age increases. Burt (1917) 
Proved that the same was true of E.A. , A child of 5 with 
an M.A. ог E.A. of 6 will usually show an M.A. or E.A. of 
12, not of 11, when he reaches С.А. 10. In other words, а 
Superiority of a year in a young child is much greater than 
the same superiority in an older child. But the ratio of 
М.А. to C.A., or E.A. to C.A., does seem to remain approxi- A 
Mately constant, hence the-I.Q. and E.Q. have been very 
widely adopted as measures of intellectual and educational 
advancement and retardation. 

А Ambiguity in I.Q. Units There is yet another flaw both 
m M.A. and I.Q. and in the corresponding educational units, 
namely that the range and S.D. of scores expressed in these 
Units are wider for some tests than for others. For ex 
ample, the application of educational tests to a class of 

year olds might show that their average Е.А. was 9, and 
that 20% of them had E.A.s of 10 or more. But the 
pplication of a group intelligence test to the same pupils 
Tight show an average M.A. of 9 and as many as 80% with 

-А.5 of 10 or more. If this was so, then clearly an 
я Vancement of 1 year, or an LQ. or E.Q. of 111 (i.e. 

0 


> X 100) does not mean the same thing on different tests. 


Some intelligence tests yield I.Q.s with a S.D. of 15 or even 

less, others yield I.Q.s with a S.D. as high as 25 or more. 

Tobably educational tests show similar irregularities. 
7 
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The S.D. of Stanford-Binet I.Q.s seems to be established 
now as 16}, though for long it was regarded as 15. An 
ordinary school class, tested with this test, will commonly 
obtain a range of I.Q.s from + 2c to — 2c, i.e. from 138. 
to 67. But the same class tested with a group test whose 
S.D. is 25 will range in I.Q. from 150 to 50. To take ar- 
other example : a class of very bright children may have 
an average I.Q. of 1163 according to the Binet test. The 
same class tested with a group test may have an average 
I.Q. of 125. The following table illustrates the same 
phenomenon; it shows the proportions to be expected in 
the total population with I.Q.s of 125 or more, or 75 or 
less, on tests with various S.D.s. 


S.D. or < 75 
15 * è х è 5 
164 à А i ә 64 
20 » Е Р 2 11 
25 y . : 16 


Unfortunately it is impossible to assert that any par- 
ticular value of the S.D. of I.Q.s is the * right ’ one, and 
that the other values are too large or too small. The value 
depends on certain features of the test which are still Some 
what obscure, and are too complex for discussion here. 
The figure 15 is implied when we talk of 70 as the border- 
line for mental deficiency, since most of the testing © 
borderline cases has been carried out with the Binet test 
whose S.D. was believed to be 15. And because this figure 
has been 3o widely accepted, some group intelligence tests 
have been constructed in such a way that their I.Q.s yiel 
this same S.D.? But there is only one really satisfactory 


! Certain explanations are put forward by the writer elsewhere, об 
Vernon (19875). But it seems likely that the main factor, not Ind 
tioned in that article, is the degree of heterogeneity among the ES 
ог sub-tests. 'The more homogeneous the test material, the hig! si 
will be the S.D. of the test I.Q.s. Hence the Binet test and ae 
* omnibus ? group tests seem to have lower S.D.s than tests whi Р 
are composed only of a few separate sub-tests. 

- = Namely the Moray House Tests. Cf. Thomson (1932). 
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solution to this problem, and that is for testers and others 
to give up using LQ.s in the present loose fashion, as 
though they represent a standard scale of measurement 
Which is independent of the partieular test employed. 
Actually, like all other test scores, they can only be correctly 
interpreted if their S.D. is known, in addition to their 
average of 100. As we shall see in Chap. X, I.Q.s are open 
to numerous additional misunderstandings. 

To conclude: the measurement of abilities in age units 
and quotients is simple and convenient, but liable to great 
inaecuracies beyond the age of 11 years. А far better 
System would be to establish the mean test scores of suc- 
Cessive age groups, and then to express superiority or infer- 
lority to the means not in terms of M.A. or E.A. but by 
© scores. Real accuracy would be obtained if we could say 
that a child of 10 was 1c superior to average 10 year olds 
on the Binet test, 3c on performance tests, 14с on reading 
tests, and so on. This system, moreover, would be applic- 
able at all ages. Its only disadvantage is that testers 
Would have to have some statistical insight in order «о 
understand and use it. 

„2 Possibly a scale ranging from + 2-5 to — 2:5 (or + 3 to — 3) 
18 a little inconvenient. Hence American testers often transfer such 
Scores into a scale with a mean of 50 and a S.D. of 10 or 20. 


CHAPTER V 
RELIABILITY AND THE THEORY OF SAMPLING 


Untrustworthiness ој Results obtained from Small Numbers 
of Persons.—This chapter is likely, we must admit, to cause 
considerable difficulty to readers unfamiliar with statistical 
concepts. It introduces a point of view with regard to 
educational measurements and experiments which may at 
first seem very comnlex and involved, yet which is abso- 
lutely essential to a proper understanding of scientific 
education and psychology. 

The scientific investigator naturally wishes to arrive at 
results which are true for people in general. He wants to 
prove that a certain method of teaching gives superior 
results with all children, that the correlation * between 
intelligence tests or school examinations and subsequent 
scholastic achievement always reaches a certain figure, ог 
he wants to know the average test score of all persons 
brought up in a particular environment, e.g. of rural as 
contrasted with urban children, or Scottish as contrasted 
with English, and so on. But he can seldom, if ever, 
measure all the persons to whom his conclusions are sup- 
posed to apply. He is forced to work with, as it were, а 
sample oi the population, and he is therefore entitled to 
regard his result whether it be an average, or a difference 
between two averages, or a correlation coefficient, ete.— 
only as a sample result, which might alter to some extent if 
he repeated the work on another sample of the population. 

He must, of course, take précautions in the first place to 


1 A particularly clear explanation has been given by Thouless 
(19895). À 
* The reader who is unfamiliar with the conception of correlation 
may be advised to study pp. 105-8 of the next chapter before con- 
tinu'ng with this chapter. 
88 
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see that his sample is not what we calla biased one. Sup- 
posing his problem is the securing of norms for a group 
intelligence test, and he wishes to find the average and the 
dispersion of scores among children aged 11 +, he must, 
а5 we saw in the previous chapter, avoid choosing only 
slum schools or only private schools. But even when 
Such bias in selection is eliminated, he would still have to 
content himself with a limited sample of the children aged 
11 + in the country as a whole, and the average of this 
sample might differ to some extent from the true average. 
Similarly if his problem was the comparison of the intelli- 
gence of slum and private school children, he could not 
measure all the schools of these types. Thus the difference 
which he obtained would only be an epproximation to the 
difference that might result from much more complete in- 
vestigations. Here are two concrete instances. 

Ex. 26.— Twelve of the writer's training college students 
Were given an American objective test of reading ability, 
namely the Nelson-Denny test, which measures knowledge of 
Word meanings and comprehension of paragraphs. Their 
Scores, arranged in order, were: 140, 136, 185, 183, 126, 120, 


109, 108, 98, 89, 87, 82. 
Now, most of these scores are well above the average for 
American college students, namely 96. Their average, 118:6, 
alls at the 78th American percentile. But obviously the 
Scores are very variable. The question is, then, are we or are 
We not justified in deducing from them that Scottish training 
College students possess higher reading ability than American 
College students ? Obviously no such deduction could be made 
from the score of only a single student, but does the average of 
а sample of 12 provide a much surer basis ? The 12 were 
Picked at random from among the writer’s “264 students. 
Thus he was careful to avoid selecting only those, or a pre- 
Ponderance of those, who had an Honours degree in English. 
nd men and women were equally represented, since it was 
Conceivable that a sample composed only of women or only of 
Men would be biased. Yet it is quite possible that he un- 
intentionally picked an undue number of clever students, and 
at if he picked another dozen, their scores might be mostly 
elow 100, and their average no higher, or even lower, than the 
meriean figure. How liable then is such an average of twelve 
measures to vary ? 
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The question can be answered in this instance, since the same 
test was given to 21 other sets of 12 students, also chosen at 
random. The 22 averages range from 103-6 to 126-8, their 
distribution being shown in the following table. As we sus- 
pected, the averages are rather variable, though less so than 
the separate scores, which range all the way from 62 to 160. 
And though our first average of 113-6 happens to be fairly close 
to the grand average of 114-9, yet we might, in the luck of the 
draw, have obtained an average as much as 10 points above 
or below this figure. 


Average Mark Frequency 

124 + 1 
121 + 3 
118 + 2 
115 + 8 
112 + 8 
109 + 2 
106 + 2 
108 + 1 

22 


Ex. 27. —Тће correlation coefficient was calculated between 
the marks of 25 students on an examination given near the 
beginning of the year, and their marks on another paper at the 
end of the year. The coefficient was + 0-58. But 25 students 
is only a very small sample, and similar results calculate 
for 18 other similar samples varied widely, as shown in the 
following table. 


= Correlation 


Coefficients 
+ 0-60 to + 0-69 
+ 0-50 to - 0:59 
+ 0:40 to -+ 0-49 
+ 0:30 to + 0-39 
4 0-20 to + 0-29 
+ 0:10 to + 0-19 
0-00 to + 0-09 
— 0:10 to — 0-01 


| єн єн слао нон Ы 


Thus the original figure is far from trustworthy, and ereh 
the average coefficient of + 0-815, for 850 students, though 
prohably much more reliable, might also alter to some extent 
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ү the same examinations were compared among other similar 
arge groups of students. 

The Scientific Standpoint.—The scientific investigator, 
therefore, always looks on any numerical result as a sample 
one which may deviate more or less from the truth. Even 
when he has excluded errors of selection such às would 
make the persons he is measuring in any way unrepre- 
sentative, he has to consider the probable extent of these 
So-called chance errors of sampling. He therefore goes on 
to imagine that it is theoretically, though perhaps not 
practically, possible to obtain a very large number of other 
Such samples. ‘The two tables given above are instances 
оп a small scale. They both bring out a vital step in our 
argument, namely that such additional samples tend to 
conform to normal distributions. The numbers involved 
are small, hence these distributions are irregular, but there 
exists ample evidence to show that larger numbers would 
eventually yield smooth normal curves. 

Let us take a fresh instance, where only a single sample 
Tesult is available, in order to illustrate the further stages ^ 
of the argument. 
st Ex. 28,—The average scores for 186 men and 114 women 

udents on the Cattell Intelligence Test III (cf. Ex. 8, p. 12) 
Were 101-55 and 99:65 respectively. Thus the men were 1:90 
Points superior. 

Can we deduce that men students, 


5 general better than women, or W p 
he same tests to many other similar groups, find the differ- 


nce to be larger in some cases, negligibly small і" other 
cases, or even reversed in others ? This difference, which 
We shall call d, is only а sample опе, which should be re- 
Barded as falling at some point on the distribution of а 
arge number of possible alternative d's. Clearly the best 
Possible estimate of the true difference between men and 
Women would be a grand average of all these hypothetical 

$. "This quantity—the mean of the distribution of d's— 
We shall call D. The particular d which we know is @ more 
or less inaccurate approximation to D. 


of the type tested, are 
ould we, if we applied 
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A formula has been worked out by means of which it is 
possible to predict, from the actual data available, what 
would be с,, that is the S.D. of this distribution of d's. 
The formula will be given later. In the present instance 
it yields o, = 1-41. As the distribution is a normal one, 
this figure will enable us to define its limits. We can state 
that about one-third of the 45 must fall between D and D 
+ 1:41, another third between D and D — 1:41, and that 
about one-sixth of them must deviate more widely from D. 
Scarcely any of them, or only 0-135 %, will be as large as 
D + 4-23 (i.e. 8c), and only 0-185% will be as small as 
D — 428. Hence 99-73%, (ће. 100 — 2 x 0-185) of the 
d's lie within the limits D + 4:28, and the odds against 
any d lying outside these limits are 99-78 to 0-27, or 870 
to 1. As a general rule, the probability is 870 to 1 that 
any d which deviates from D purely on account of chance 
errors of sampling will not do so by more than 80,. 

Certain other probabilities have been deduced from the 
probability integral graph (Fig. 18), and are listed in the 


following table : Р 
Frequencies of d's falling Odds against d's falling 


Limits outside these limits outside these limits 
+ 40 А · 2 x 0:00329/ 15,600 to 1 
+ 8o Р · 2x 01859, 370 to 1 
t 20 $ . 2x 23289; 21 to 1 
low e . 2 X159% 2tol 
+ 0:67450ь . . 2x 2509, 1tol 


"Now these predietions must apply to the particular d 
which we happen to know, whose value is 1-90. The odds 
are very much against its deviating as much as 86, from 
D, but it might quite possibly deviate as much as, say, 265 
since the odds against this are much lower. We can there- 
fore, as it were, reverse the argument, and state with 
practical certainty that D must lie between 1-90 + 365, 
Le. between + 6-13 and — 2:88. We do not, of course, 
know what is the value of D, but if it was greater than 
+ 6:18, or less than — 2-33, our d would be more than 
35, removed from it, which is highly improbable. Simi- 
larly we can state that the chances are 21 to 1 that D is 
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not greater than + 4-72, nor less than — 0-92 (i.e. 1-90 
+2 x 1-41). 

The Standard Error as an Index of the Trustworthiness of 
a Result.—It may be seen then that c, provides us with 
an index of the margin of uncertainty in our d. It tells us 
that the difference of 1-90 in favour of men students is by 
no means trustworthy, since the true value of D may be 
considerably larger or smaller. There are 2-28 chances in 
a hundred that it may be as high as + 472, and 2-28 
Chances that women may actually be better than men 
by 0-92. 

This is the manner in which the scientific investigator 
always thinks of the results of psychological or educational 
experiments. His obtained result, being only а sample 

ased on a limited number of cases, is likely to deviate 
More or less from the true result which he would obtain if 
€ could experiment on a very much larger set of people. 
€ can, however, work out the op, often referred to as the 
Standard Error or S.E. of his result, which tells him, not 
Precisely how much his result is in error, but what is the 
Probability that his result is in error by various amounts. 
There is a 2 to 1 chance that the error may be no bigger 
than the S.E., and the odds against the occurrence of a 
larger error rise very rapidly, so that it is extremely unlikely 
9 be greater than three times the S.E. 
he Probable Error.—The figures at the foot of the tahle, 
Above, show that there is an even chance of D lying within, 
9r lying outside, the limits + 0-67450,. This particular 
fraction of the S.E. is known as the Probable Error ог Р.Е. 

* name is not a very good one, but in a Sense it is the 
Most probable error, since the odds against either larger 
Or smaller errors than this are higher than 1: 1. In our 
example, the P.E. = 0:6745 x 1-41 = 0:95. This means 
that D is just as likely to lie between 1:90 + 0-95 and 

0 — 0-95 as it is to lie outside these limits. 

. The РЕ, is frequently used as an index of the unre- 
liability or margin of uncertainty, in preference to the S.E. 
nstead of saying that the odds against D deviating from 4 
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by 8 x S.E. are 370 to 1, we can say that a deviation of 
44 x P.E. possesses this degree of unlikelihood, since 
4$ x 0:6745 x S.E. is approximately equal to 3 x S.E. 
Or if + 2 x S.E. are regarded as the limits within which 
D may reasonably be expected to lie, we can alternatively 
denote these limits as +8 x P.E. Often a numerical 
result such as our d of + 1:90 is stated in the form: 
1:90 + 0:95, where the second figure is the P.E. of the first. 

While both the S.E. and the P.E. indicate the likely 
amount of error in some numerical result, the S.E. is, as 
we have seen, primarily a measure of the dispersion of all 
the possible sample results. The same is true of the P.E. 
In the hypothetical distribution of sample d's, the limits 
D + P.E. to D — P.E. enclose 50% of these d’s, and 25% 
of them are greater than D + P.E., 25% less than D — 
P.E. Note therefore that the P.E. is identical with ©, 
the semi-interquartile range, of this distribution. For the 
limits Q, to Q; also mark off the middle 50% of measures 
in a normal distribution. 

"The Statistical Significance of a Difference.—Many ex- 
periments have shown that the scores of comparable groups 
of men and women on intelligence tests.are practically 
identical, or that any differences found are not reliable. 
Since, in our example, the odds are only 21 to 1 against D 
being greater shan 4-72 or less than — 0-92, it seems quite 
possible that here too the d is not reliable, and that the 
true value of D is zero. If this be so, then it is easy to 
determine from the probability integral graph the chances 
that a saxcple might show a d of + 1:90, merely through 
errors of sampling. 1-90 is 1:35 X 1-41, or 1'85а). Con- 
sulting the graph (Fig. 18), we see that 9% of measures in 
a normal distribution lie at or beyond 1:35с from the mean. 
Hence when D is zero, 9% of the sample d's amount to 
+ 1-90 or more, 41% lie between 0 and + 1-90, and 50% 
are below zero. Thus the odds against a d of 1-90 being 
due to errors of sampling are only 91 to 9, or about 10 to 1. 
In other words, it might occur as a matter of chance once 
in ten testings of similar samples of students. Such odds 
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are too low for us to put any trust in the conclusion that 
· men students are better than women at this test. Had d 
been + 4-28, or even + 2-82 (3 or 2 X о), we might have 
felt more certain, since the odds against such deviations 
from a D of zero would have been 99-865 to 0-185, or 740 
to 1, and 97-72 to 2-28, or 43 to 1, respectively. 

ax difference between the average scores of two groups 
divided by its S.E. is often referred to as а critical ratio, or 
indicated by the symboli. In this example t is only 1:85. 
It is customary to place very little reliance in a difference 
When its £ is less than 2, or preferably 8; in other words, 
to regard such a d as lacking in statistical significance. 
Alternatively, a d is regarded as reliable, and statistically 
Significant. if it exceeds 3, or preferablv 44, times its P.E. 

We should note that, even if d had been large enough in 
the above example (relative to its S.E. or P.E.) to be 
accepted as significant, the conclusion that men are some- 
What superior at the test would have applied only to the 
type of students tested, namely Scottish university - 
graduates who had chosen the teaching career. It could 
Not be extended to English training-college students, nor 
to other university students, far less to men and women in 
Вепега], since the persons actually tested could not be 
regarded as typical or representative of these wider groups. 

gain it would be safer not to claim that such а. difference 
Was a difference in intelligence, but only in the ability 
which is measured by this particular group intelligence test. 
The statistical treatment which has been applied to the 

istributions of scores only informs us how sound and 
Teliable are the results in respect of other siínilar testees, 
1e, it takes care of sampling errors, but it does not eliminate 
errors or bias in selection. As mentioned in Chap. I, 
Statistica] methods cannot improve data which are in any 
Way inaccurate or defective; they merely bring out the 
Significance (or lack of significance) of the data actually 
obtained, 
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FoRMULZ FOR THE STANDARD AND PROBABLE ERRORS OF 
NUMERICAL RESULTS 


Common sense tells us that a numerical result such as 
an average, or a correlation, or a difference between two 
averages, will always be more reliable the greater the 
number of cases on which it is based. In Ex. 26 (p. 89) 
it was shown that the averages obtained by sets of twelve 
students on a test of reading ability were exceedingly 
variable. Obviously twelve is far too small a sample upon 
which to base conclusions. In order to make a trust- 
worthy comparison between the performance of Scottish 
and American students at this test, we should need much 
larger numbers. ence most of the formule for reliability 


involve the quantity > which means that by quadrupling 


the size of a sample, we can halve the variability of any 
result computed from this sample, or double its reliability. 
‘Another important factor which governs the reliability of 
a result is the variability among the original measures. 
Suppose that, in Ex. 28, all the men students had scored 
101 or 102 (average 101-55), and all the women had scored 
99 or 100 (average 99-65), we should certainly have been , 
entitled to claim that men students do better at the test 
than women. But it is because the men's scores range all 
the way from 73 to 184, and the women from 78 to 129, 
that we felt suspicious of the difference in averages, and 
applied--tatistical treatment to it. There was so much 
overlapping that about 429/ of the women obtained higher 
scores than the average man, and about 42% of the men 
scored lower than the average woman. Hence the relia- 
bility formulæ also involve some measure of the dispersion 
or extent of variation of the original measures. 

The S.E. of a Difference.—This S.E. is given by the 
formula : 

бр == of оз? 


N, Ny 
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qu 6; and c; are the S.D.'s of the scores in the two groups, 
ue and N, are their numbers. The P.E. of a difference is 
6745 times this quantity. 


a^ е a fresh illustration, let us compare the scores on 
derme elson-Denny reading test of students having Honours 
SE es in Arts with the scores of students having Honours in 

ence or ordinary degrees. The essential data are as follows : 


с 
16:42 


H N M 
onours Árts students . 66 129-47 
18-90 


Other students. . 198 109-97 


The difference between the two averages is 19:50. 


— л / 16-42? “90? 
> yas + 18-90" _ 4/7088 + 1:804 = 2:427 


198 


or approximately 8. Such a 


cur through chance errors of 
de that Honours Arts students, 
o other graduate 


Here the crit . . 195 
e crit is ===> 
ical ratio is 2427 


елет could not possibly ос 

of th ing. Hence we may conclu 

stud е type tested, are definitely superior t 
udents at this test. 


Зи of a Difference between Averages of Scores which are” 
t ава When one group of persons takes two 
Tura or if they repeat a test, and we wish to examine the 
Кус istical significance of the difference between their two 
a Tage scores, we must use а modified formula. This 
Pplies whenever there exists a correlation, 7, between the 


WO sets of scores.! 
„и e e ores 


бр == бе" + o — 270,99) 


а Ех 80.—Магкв in English and arithmetic were awarded to 
ass of 57 children. These are tabulated in Ex. 22, p. 67. 


pp ently the arithmetic marks run higher than the English 
5. 15 the difference significant 7 The figures are as follows : 
Engli h Mean Difference MN 
pis. . . 67-615 Р i 
Arithmetic. — . 70510 2805 1494 57 
Ап alternative version of the same formula is: 


Cy ee 
origin ht og — 270100 where оу and c; are not the S.D.s of the 
8l measures, but the S.E.s of the means of these measures 


(ef. 
ше following paragraph). 
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Тће correlation between the two sets of marks is indicated by 
a coefficient of + 0-542. 


o = VU. (12-04? F 11-644 — 2 X 0-542 x 19:94 x 14-64) = 

1:761 
The difference of 2-895 is only 1:64 times its S.E., hence it is 
not significant. 


S.E. of a Mean.—An average is, as we have seen in Ex. 
26, quite as likely to vary through errors of sampling as & 
difference between averages. Hence it is very useful to 
know how near to the true average of a very large group 
is likely to be the average obtained from a group of limited 
size. The S.E. of the mean is often symbolized by су, 

с 
VN 
course, the S.D. of the original scores, whose average has 
been computed. j 


and is given by the formula: e, = Here с is, of 


Ex. 31.— The twelve scores on the Nelson-Denny test, listed 
in Ex. 26, have a mean of 113:58 and S.D. of 20-07. 


20-07 
ом = MAS 

The P.E. of M is 0-6745 x S.E. — 8-908. у 
Thus the true mean for students of the type tested may just 
as well lie outside the limits 118-58 + 8-91 (i.e. 117-49 to 109-67) 
аз within them. It may be as much as 3c, away from 113-58, 
i.e. as high as 180-96 or as low as 96-20, but is unlikely to exceed 


these limits. Obviously the figure is extremely unreliable. 
By contrast the mean for 264 students was 114-9 and the S.D. 


= 5-794, 


20-15 5 у 

20:15. Heré oy = = 1:24. The sample is 22 times as 
v 264 
1 

large, hence оу is VR its previous value. The Р.Е. = 0:836. 


From these figures it follows that there is an even chance of the 
true mean lying between 115-7 and 114-1, and that it almost 
certainly does not lie outside the limits 118-6 and 111-2. 


S.E. of a Standard Deviation and of a Difference between 
S.D.s.—Not only the mean but also the S.D. of a set of 
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measures may be affected by errors of sampling. The 
formula for the S.E. of a S.D. is: 


[е 
MEZ 
Тће S.E. of a difference between two S.D.s is given by : 
ву? [^ 
2N, 2N, 


Ex. 32.—The S.D.s of the intelligence test scores of men 
and women students were 12-36 and 9-95 respectively, indicat- 
ing that the dispersion of ability was greater among the men. 
The S.E. of the difference of 2-41 between these two figures is : 


12-36 9-95? 

М» х 186 + 2х 114 
The difference is 2-42 times its S.E., and this shows a fair degree 
of statistical significance. The odds against such a difference 
occurring through chance errors are 99-23 to 0-77, or 129 to 1. 
The result therefore allows a ressonably strong presumption 
that men students of the type tested are in general more spread 
Out in their ability at an intelligence test than are women 
Students, 


S.E. of a Percentage and of a Difference between Per- 
centages.—Often it is not possible to measure the amounts 
of some quality in a group of persons, but only to state 
what percentage of the persons possess or do not possess it. 
Suppose that in several schools 60% of the boys and 65% 
of the girls regularly take milk, does this indicate a reliable 
Sex difference ? Again the trustworthiness of the result 
depends on the number of cases. If p is the percentage, 


and q = 100 — р, then the S.E. of pis A EE and the S.E. 
Of a difference between two percentages is : 


тї | Def 
Nh №, 


= 0:998 


Ex. 33.—From how many children would it be necessary to 
Obtain milk-drinking records in order to be practically certain 
that the difference of 65% — 60% is reliable 7 

This example is posed in the reverse form from Exs. 29-32, 
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but the same method is employed. If a difference of 5% is 

reliable, it must be three times its S.E. That is, the S.E. must 

not be greater than $ = 1-667. Let the number of both boys 

and girls in the schools where records are obtained be N. 
en : 


1-667 = ду 80 0 + 55 х 35 м 1,688 


The answer is, then, that the sex difference of 5% would be 
significant if based on some 1,600 odd boys and an equal num- 
ber of girls, but that it would not be so if the numbers were 
much smaller. 

One of the commonest faults in writings about educa- 
tional topics is the quotation of percentage results without 
any indication of the reliability of these results, often with- 
out even any mention of the actual numbers of cases from 
which the P.E.s could be calculated. 

It should be noted that all the formule quoted above 
for the S.E. of à difference (between means, between S.D.s, 
and between percentages) are derived from one and the 
same formula : 


op = Va, + со? — 270105 

where с; and c, are the S.E.s of the means, 5.0.5, and 
percentages, respectively (not the S.D.s of the original 
scores), The third term, 276,05, applies whenever there is 
any correlation between the original measures from which 
the means, S.D.s, or percentages are derived. But when, 
as in most of our examples above, there is no such correla- 
tion, the term vanishes. 

Statistical Treatment of Small Samples.—lt has been 
shown by"risher (1930) and others that numerical results 
obtained from very small samples are even less reliable 
than the statistica] treatment described above would lead 
us to expect. "Under such circumstances it is better to 
calculate а S.E. by substituting N — 1 for N in the de- 
nominator. Special tables or graphs are available, which 
show the probability of various values of t. However, this 
modified treatment makes very little difference until N 
falls below about 15. And as the enormous majority of 

1 Cf. Fisher (1930), Dawson (1933). 
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educational experiments are carried out on much larger 
numbers than this, we will not give any further details here. 

S.E. and P.E. of a Correlation Coefficient» —The correla- 
tion between two sets of measures is likely to be very un- 
reliable if the number of persons is small, as was shown in 
` Ex. 27 (р. 90). The S.E. or P.E. of a correlation coefficient 
tells us how widely this coefficient may be expected to 
deviate from the true value, i.e. the value which would be 
obtained from an unlimited number of persons. The 
formule for these errors will be given in Chap. VI. 

S.E. and P.E. of an Individual Test Score or Examination 
Mark.—1If some pupils or students take a test or examina- 
tion two or more times, or answer two or more supposedly 
parallel tests in the same subject, they will not. obtain 
precisely identical scores on each occasion. Hence any 
Score or mark derived from only a single test is to some 
extent unreliable. It may deviate more or less from the 
Score which would be obtained with far more thorough 
testing. In other words, the single test only gives иза 
limited sample of the pupils’ or students’ capabilities. 
Here, too, then the conception of the S.E. has valuable 
applications. But the reliability of а score depends, not 
оп the number of persons who take the test, nor on the 
Wideness of dispersion of scores, but on the thoroughness 
of the test itself, or the number of items of which it is com- 
Posed, and on the amount of variability in pupils' answers 
to these items. The methods of calculation are described 
in Chap. VII. ње 

Finally, when a test or examination is emplo; ed for pre- 
dicting a pupil’s success or lack of success during his 
subsequent educational or vocational career, it is obvious 
that the estimate obtained from the test score will deviate 
More or less from the truth. The reliability of such pre- 
dictions naturally depends mainly upon the adequacy or 

! Tt seems to be the convention in psychological literature to 
employ the S.E. in the statistical treatment of averages and the like, 


and the Р.Е, in the treatment of correlations and individual scores. 
here is no logical reason for the convention, but we shall usually 


adhere to it in this book. 
8 
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validity of the test. It, too, can be expressed in terms of 
а S.E. or P.E., and the appropriate methods will be given 
below. 

Соорхез5 оғ Ет 


А. somewhat different type of problem, to which the 
theory of sampling also applies, is known as testing for 
goodness of fit. One instance which frequently occurs in 
educational psychology is the investigation of whether or 
not a distribution may be regarded as normal. We have 
seen that distributions of small numbers of measures may 
depart considerably from the smooth normal shape, and 
that by inereasing N the irregularities tend to become 
ironed out. Here also, therefore, the result obtained from 
а limited sample (thc result being an irregular distribution) 
can deviate considerably from the result which а much 
larger sample would yield (a regular distribution), purely 
through errors of sampling. The test for goodness of fit 
tells us whether such a departure from normality is a chance 
one, or whether it is so great that the distribution cannot 
be accepted as an approximation to the normal shape. 


Ex. 34.— The following results were obtained in Ex. 19 
(p. 60). 


"m == 2 
F F, F—R gpl F zn 
1 1 
d 1 1 +2 0-6667 
6 4 
11 9 +2 0-44.44 
22 19 +3 0:4737 
26 30 —4 0:5333 
85 41 —6 0-8780 
47 44 +8 0-2045 
48 40 +3 0-2250 
27 28 —1 0:0356 
18 18 0 0-0000 
10 9 +1 0-1111 
8 4 pi -5000 
0 5 3 1:50 


250 250 5.0723 
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They show in the F column the actual frequencies of scores 
obtained by students on the Cattell test, and in the F, column 
the frequencies which would be expected if the distribution of 
Scores was a truly normal one. 

The method is to tabulate the differences between each actual 
and expected frequency. The smallest frequencies, at the head 
and foot of the column, are usually grouped together, as shown 
here. The differences are squared and divided by their respec- 
tive Fs. The sum of these quantities is called chi squared 
(x°). Tables are available showing the probability of апу value 
of chi squared for any number of classes. In this instance the 
probability is 0:751 That is, we might expect as great, or 
greater, deviations of the distribution from normality in three 
cases out of four by pure chance. The approximation to 
normality is quite satisfactory. 


Тће same method can be used to determine whether 
Or not a set of frequencies conforms to any other expected 


distribution. 


_ Ex. 85.—At the time of a Scottish university Rectorial elec- 
Ко 82 women students informed the present writer as to 
nor votes for the four candidates. These votes are listed in., 

e F column below. Now the voting for the candidates among 
the student body as a whole was 38%, 26%, 20%, and 16% 
respectively. If, then, the 32 students were a representative 
sample of the student body, i.e. if they had cast their votes 
In the same proportions as the rest, the figures would be approxi- 
mately as shown in the F, column. Clearly the F and F, 


Candidate Е Е, F-F ш 0 
Pacifist 18 12 +6 8-000 
Nationalist 4 8i —44 2-382 
Unionist 5 6i —1 0-846 
Socialist 5 5 о e 0-000 

32 82 5-728 


nt, but before we can conclude 
al sample of the student body, 
y of such divergences occurring 
g. The procedure is the same 


шапа are decidedly differe 
eae these 32 were not a typic 
th: must examine the probabilit 

tough mere errors of samplin 
о explain the use of chi squared 
m should study the concept 
Fisher’s, or other textbooks. 


Ne have not the space here t 
es. "Those who have need of the 


of • А 
£ degrees of freedom ' in Dawson’s, 
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as before. The value of chi squared which is obtained, 5:728, 
is found from the tables to possess a probability of 0-125. That 
is, it might occur by chance 1 in 8 times. Hence the difference 
between the two distributions is not reliable. 

Suppose, however, that there had been a hundred students 
instead of 32, voting in the same proportions as these 82. Chi 
squared would then have been 22? times as big, and its proba- 
bility would have sunk to approximately 0-001. From this 
larger sample we could certainly have deduced that the women 
students were more pacifistic, less nationalistic, than the other 
students, but such a deduction from the small sample would 
have been very untrustworthy. 


CHAPTER VI 
MEASUREMENT OF CORRELATION 


A very wide range of problems in educational, and other 
branches of, psychology depends on the study of relation- 
Ships between variables. For example, examinations are 
set at the end of the primary school career, partly in order 
to test the work that the children have done, but also partly 
because achievement in English and arithmetic is supposed 
to be related to, or predictive of, future achievement in 
more advanced work. The main object of applying intelli- 
gence, or other aptitude, tests is to show probable ability 
in various fields which are related to the test performances. 
In many secondary schools the pupils are allotted to differ-., 
ent classes for general work, for mathematics, and for 
foreign languages, since it is believed (with much justifica- 
tion) that abilities in these three types of study are not 
very closely related, that some may be good in general 
Work, poor at mathematics, medium at languages, and 
50 on, 

Several different methods, applicable to different kinds 
of data, are available for determining the extent of relation- 
ship between two variables. Almost all of them yield 
Indices ranging from 0 (no correlation) to + 1+) (perfect 
Correlation) Ву far the most frequently used are the 
Product-moment correlation coefficient and the rank-order 
correlation coefficient. We will point out later the limita- 
tions of these two methods, and mention briefly some of 
the alternatives, 


Ex. 36,—The following marks were obtained by a class of 86 
children on examinations in history and geography. A glance 
will show that there is some correspondence ; that those who 
obtain high marks in one tend to get high marks in the other, 
105 
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X x 5 

Мато History сва њу X, "a Xi х, 
Alan .... . 16 12 10 5 8 6 
Helen .... . 19 15 14 18 20 12 
Johm леш . 17 il f $ 6 | 17 11 
“КОЛГО. . 10 1 | 8 4 12 8 
Кышы сене є 18 14 18 20 7 2 
18 18 | ла 8 17 16 

15 4 12 4 14 11 

17 18 18 14 | 14 9 

12 12 16 10 | 12 3 

16 8 6 4 | 16 1 

11 10 18 14 20 15 

14 9 12 11 17 6 


and that low marks in one tend to go with low marks in the 
other; but that there are many exceptions to this trend. The 
state of affairs may be seen more clearly if the two sets of 
measures are plotted against one another. Fig. 21 is called а 
scatter-diagram or scattergram. The marks have both been 


GEOGRAPHY (Xj) B 


HISTORY (Xj) 


Fic. 21.—Scattergram, comparing two sets of school marks. 


grouped into classes of 2. History marks (X) are shown on 
the vertical axis, geography marks (Х,) on the horizontal axis. 
Each pupil's position in the two exams is shown by а tick. 
For example, Allan's 16 and 12 are represented by a tick 
opposite the 15-16 mark in history, and below the 11-12 mark 
in geography. Four pupils obtained either 17 or 18 in history 
and either 13 or 14 in geography, hence there are four ticks in 
this box or cell. е 
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It will be seen that the majority of ticks tend to be 
grouped along the dotted diagonal line, although there are 
some rather extreme exceptions, for instance, the pupil 
with marks of 16 and 1, and the pupil with marks of 10 
and 17. Obviously the closer the correspondence between 
the two sets, the more closely would the ticks fall along 
the diagonal, and the fewer would be the exceptions whose 
ticks are far removed from it. Fig. 26 (p. 154) and Ex. 45 
(р. 181) show the scattergrams for a low correlation of 
+ 0-146 and a high correlation of + 0-799 (the numbers 
in each cell represent the numbers of ticks). Only if each 
Pupil got marks which exactly corresponded would the 
correlation be perfect and the coefficient be + 1:0. Note, 
however, that it is not necessary for the marks on the two 
variables to be identical for a perfect correlation to be 
Obtained, since these marks may be arranged on different 
scales. In Fig. 21 the geography marks are on a different 
Scale from the history marks ; 10 on the one corresponds, 
or is relatively equivalent, to 14 on the other. Consider 
another instance: the weights and heights of а class or 
Children would certainly yield а very high correlation, 
although they would be measured on entirely different 
Scales—pounds and inches. It is the closeness of agreement 
of relative scores or measures which governs the size of the 
Correlation. , д 

Very few correlations in educational or psychological in- 
vestigations approximate to + 1:0. Usually there is only 
& general trend for high scores on one variable to go with 

igh scores on the other, а trend which admits of many 
exceptions. Two tests of the same ability, e.g. two similar 
arithmetic papers, may perhaps give а coefficient of 
+ 0-90 or over. Tests or exams in different subjects, 
e.g. arithmetic and English, or history and geography, are 
More likely to give moderate correlation coefficients of 
+ 0:40 to + 0-70. i ý 

Occasionally one may find negative or inverse corre- 
Spondence, when high scores on опе variable go with low 
Scores on another variable. For instance, m an ordinary 
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school class intelligence or scholastic achievement and age 
are often negatively related, since the oldest pupils are 
rather dull, the youngest pupils very bright. Negative 
correlations are expressed, just as positive ones, by indices ` 
ranging from 0 to — 1-0. 

Calculation of Correlations by the Product-moment Method 
(sometimes called the Pearson-Bravais Method).—The funda- 
mental formula for the correlation coefficient, 7, 15: 


E 
Noo, 


Here c, and o, are the S.D.s of the two distributions. 
ал represents а score or measure expressed as a deviation 
from the mean of all the X, measures, а, is a measure ex- 
pressed as a deviation from the mean of all the X. » measures. 
432, is the product of a person's two measures, and Xt 
is the sum of all these products. И 

Since the means are seldom likely to be whole numbers, 
the computation of Баја, would involve very troublesome 
arithmetic. Hence, just as in calculating S.D.s, it is usual 
to resort to the device of the arbitrary mean. Each of the 
quantities in the above formula then requires a correction. 
If a is a deviation from an arbitrary mean, o; is no longer 


ay N but N N 
Xm. 


(cf. p. 56). A corresponding alteration is made in оз. N 


becomes Bem, Хш. Xm 


N NN Hence the full formula is : 
Enn Xn „ Xn 
N N N 
Улу? ( X) R Улу? ( x а 
N N N N 


This same formula applies if the original scores are used, 
ie. if X, and X, are substituted for a, and a. The only 
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difference is that the arbitrary mean is now zero, instead 
of being some whole number chosen close to the true mean. 

Statistical textbooks often provide modifications of this 
formula, or ‘patent’ correlation techniques which are 
claimed to be simpler. The present writer prefers the 
direct method given. here, both because it makes use of 
concepts such as the true and arbitrary mean, o, and the 
correction required when c is calculated from an arbitrary 
mean—concepts with which the student should already be 
familiar—and because errors in computation are fairly 
readily detected. Moreover, a coefficient accurate to three 
places of decimals can be obtained by this method if the 
multiplication and division, and the finding of squares and 
square roots, are done with a 10-inch slide-rule and/or four- 
figure logarithm tables. 


Ex. 37.—The formula will be applied first to the original 
table of marks, given in Ex. 86, and later to the same marks 
grouped into classes (as in Fig. 21). This first method should 
generally be employed only when N is small, say less than 50, 
and when a high degree of accuracy is needed. In the following 
table, columns X, and Х, list the original marks. As arbi- 
trary means, 14 and 10 are selected. Columns æ, and a; list 
the deviations of X, and X, from these 4's. They are summed 


to give Xe, and Zu, 


За 5 Р Sag  —2 _ 20-0555 
= +3 = + 0-1389 N 36 
Therefore the true means аге 14-189 and 9-945. Later we shall 
need 


N 


These come to + 0:019, + 0:008, and — 0:008 герое 

ote that three places of decimals are sufficient for these а} 
rections. Special attention should be paid to the sign (+ or 
of Lı 22, 

"Ma | ? give the deviations 

The next two columns, c, and 22, Wü e de 

Squared. These are summed and averaged. 
Хат; 836 _ „а. 

NOE = 15-250 у е” 
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X, t 2, a? 2 212, 
12 +2 +2 4 4 + 4 
15 +5 +5 25 25 + 25 
11 +3 Jg 9 1 48 
17 —4 +7 16 49 — 28 
14 +4 +4 16 16 + 16 
16 +4 +6 16 36 + 24 
4 +1 —6 1 36 — 6 
18 +8 +3 9 9 +9 
12 —2 +2 4 4 — 4 
8 +2 = 4 4 —*4 
10 —8 о 9 0 0 
9 0 —1 0 1 0 
5 —4 =% 16 25 + 20 
18 о + 8 о 64 0 
6 — 6 —4 36 16 + 24 
4 —8 —6 64 36 + 48 
20 +4 +10 16 100 + 40 
8 0 —9 0 4 0 
4 —2 —6 4 36 + 12 
14 +4 +4 16 16 +16 
10 +2 0 4 0 0 
4 —8 — 64. 36 + 48 
14 +4 +4 16 16 +16 
11 == d 4 1 —2 
6 —6 —4 36 16 + 24 
12 4-6 4-2 86 4 + 12 
11 +8 +1 9 1 + 8 
8 —2 => 4 4 7 
2 — 7. —8 49 64 + 56 
16 +8 + 6 9 36 + 18 
її 0 ч о 1 0 
8 0 —2 0 4 0 
8 —2 ку 4 49 +140 
1 +2 ES 4 81 —18 
15 +6 +5 36 25 + 80 
6 218 = 9 16 — 12 
86 + 61 + 72 549 836 + 466 
—56 = "а = 74 
=+ 5 = oy = + 892 
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Xo; [Xii 
Hence оу = "i a ( x) = A/15:250 — 0-019 = 3-908. ' 


а, = 'V/28-222 — 0-008 = 4-819. 
The last pair of columns gives the products of the deviations ; 
the + and — products are separated merely for convenience 
in summing. Special attention to the signs is needed. When - 
z, and а, are both + or both — the product is plus, when one 
is -- the other — the product is minus. Notice how the 
exceptional cases, such as the pupil with 10 and 17, make large 
negative contributions to 22,2, and so ultimately reduce the 
size ofr. The pupils who get very high marks on both variables, 
ог very low marks on both, such as those with 6 and 4 or with 
E and 20, make large positive contributions to 22,7, and help 
raise r. 


Zac, 4-892 
22105 0 == = 10:889. 
М 86 10:33 


Sx. BY : А 
The correction, x x ^e is subtracted from this, but as its 


Sign was —, it is in this case added. 
Xam 9 im. 
Em. 30. X — 10-889 — (— 0:008) = 10-897. 


in the second formula have been 
e first, or fundamental, 


All the corrections indicated 
made, hence we can finally apply th 
Correlation formula : 

шм LN BOTS уро d 
Pi = 2:819 x 3:908 + 
Thus the coefficient shows moderate 
€ two sets of marks, as We guesse 
Scattergram, : 


correspondence between 
d from looking at the 


Product-moment Correlation between Tabulated Measures. 
—When the measures are grouped into classes, the pro- 
cedure is essentially identical, except that thecomputations 
are, of course, performed for all the measures in any one 
class at one time. We are no longer concerned with the 


deviations of single measures from their mean, and their 


Products, but with deviations of classes from the mean 


class, and their products. At no time is it necessary to 
translate the results back into terms of the original 


Measures, by multiplying by ¢ (the size of the class). 
The two sets of measures are first plotted in a scatter- 
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gram, which should contain, if possible, between 11 and 
19 rows and a similar number of columns. There is no 
need to have identical numbers of rows and columns. In 
Fig. 21 (р. 106) the data are, as a matter of fact, rather too 
coarsely grouped, since there are only 8 rows and 10 
columns. Over-coarse grouping has the effect of reducing 
slightly the size of r (a formula for correcting this may be 
found in advanced statistical textbooks), By adding up 
the frequencies in each row, and in each column, we get 
the two frequency distributions of grouped measures. 
Fig. 21 is reproduced below with the frequency distribution 
of z, measures (F,) written at the right-hand side of the 
rows, and the frequency distribution of z, measures (F;) 
written at the foot of the columns. The frequencies in 
both these distributions should be checked by making sure 
that they add up to N. 


= —4—8—2—1 0414248+44+45/F, 


+з] ; Ц» 

"| | falafel | [5] ye 
-+ | ја | | а јај |] | 
"а | || | | |] ја 
- | jel | | EL 

басова 

Њ 2 5 4 4 8 7 4 4 2 1/86=N 


Àn appropriate arbitrary mean is now chosen for the 
а classes and for the 2 classes, and the classes above and 
below these means are numbered +1,+2,...—1,-2, 

· ete. These numberings are written at the left-hand 
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8 of the rows, and above the tops of the columns. 
andards deviations are now calculated in the usual way. 


z, and 
үе F, к, Ел Ee Fyz, 23 
FA 2 8 32 
+8 3 4 | = 9 12 27 36 
+2 10 4 20 8 40 16 
F 5 7 5 7 5 7 
0 5 з | +84 +40 0 0 
—1 6 41-6 —4 6 4 
= 2 4 4 8 8 16 
=8 8 5 9 15 27 45 
эў 2 2 8 8 32 82 
86 зв | — 27 — 85 145 218 
SF, Ета 
| =+7=+5 
TF, +7 EF: 
= Не А Нур. „у 
N = 86 + 0:1944 ( N ) 0:0878 
EXEa, +5 EF? 
272 272 
= = i = 0:0198. 
N = + 0:1389 ( N ) 


The correction mn x Im, which will be needed later 


in the numerator of the correlation formula, is : 
+ 0:1944 X + 0:1889 = + 0:027 
2 
EE 145 — 4:028 


N 36 
c, = V/4:028 — 0:088 = 1-997. 


2 
EF ots" _ 218 _ 5.917 


в = УБ — 0-019 = 2-429. 

Note that these S.D.s are approximately one-half those 
Obtained in the previous computation of r (namely 3:908 
and 4-819), the reason being that we have grouped the 
original measures into classes of 2. They are not exactly 
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half because the grouping has produced some slight dis- 
tortion of the distribution. 

Now to obtain the products, the frequency in each cell 
must be multiplied by its æ, value and by its а, value. 
For instance, the entry 2 in cell а, = + 8, а, = + 8 con- 
tributes 2 x 3 X 8 = + 18 to ХЕаа,. The entry 1 in 


cell ар = — 2, a = + 4 contributes 1 X 4 x — 2 = — 8. 
Cells corresponding to Frequencies Total Еду, 
212, this value of 2,2, in these cells F's 
0 а = +1 а= 0 1 
о о 1 
—1 о 1 0 
о —1 2 ШЫ 
о +1 1 
0 +4 1 
T1 а=+1. љ=+1 1 2 2 
Р аў —1 ЖӨ Seu 
+ 2 а, = + 2 а = +1 2. 2 +4 
+ 8 а= +8 а= +1 їй 5 9 
=] =g a [| = ae 
ЈЕ doma ш: ‘ los +420 
* = — 
$ 6 mra imp 2 t РА 
+9 а = + 8 аз = + 8 2 2 28 
+ 10 а = + 2 ад == + 5 1 l' i 
+ 12 а = — а = — 4 1 7 3 36 
= =$ ES bee 
+ 128 
ET d4-—--1 = —1 1 1 8 — Eo 
—T +1 2 Ј 8 
3 а = +1 а = — 8 1 lw 
= а = + 2 t= — 2 1 } 2 CSS 
+1 —% Т " 
—8 m=? а= +4 1 Y. cues 


Fav, = + 101 


It is unnecessary, however, to consider each cell in turn. 
‘Instead we take each possible value of cv, in turn, as 
shown in the accompanying table. All the cells for which 
either ау = 0 or æ, = 0 contribute zero to ZF 2,2. It will 
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ү rem the total frequencies in such cells is 7. The 
bs quency in the cells where а = + 1 is 2, hence 
^ entered in the last column. 
n: E student becomes familiar with the method 
Ө ч тоша 1 шеп pud to write out the second, third, 
Inm ir columns of such a table. Only the first, and 
ES йт wo are needed. The simplest plan is to draw a 
нейн аР each of the cells for which ада; — 0, and to 
then p the frequencies in these cells while doing so, and 
enter the total in a table thus : 


EX. F Ел; 
0 T 0 
+1 2 + 2 


RS а out the cells where 232» = + 1, and enter in 
M e, and so on. At the end, sum the F column, in 
have be, make sure that none of the entries in the cells 
Sior cen omitted. The last column is summe 

E 2, and the rest of the working is as before. 

Few, _ +101 | 

М == = 9.806. The correction, to be sub- 

tracted, is -+ 0-027. 


2-806 — 0.027 = 9 HES, JEG 

0027 = 2779. T — 5439 x 1997 ^. 578. 
tortion due to grouping 
ttergram, the result is 
from the ungrouped 


T S 
а pe seen that in spite of the dis 
nearly ide into a rather coarse sca 
ide entieal with that obtained 
h es, namely + 0:579. 
NS are, of course, numerous 
should ie product-moment corre 
squari e checked up and down; Im" 
ng, ete., may be done with logarithm tables and con- 


did with a slide-rule. But а little practice will enable 

N #4 udent to predict 7 fairly closely from consideration 

aie | scattergram, and so to judge whether or not the 
he reaches is likely to be correct. 


possibilities of error in 
lations. All additions 
multiplying, dividing, 


d to give « 
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An Alternative Method of Calculating Correlation from the 
Scattergram.—We will describe one of the various alterna- 
tive methods, which may be found simpler to use than the 
above, direct, method. It eliminates the collection of 
the contributions to ZFz,2, from each cell, and substitutes 
the calculation of a third S.D., which we shall call сз. The 
formula for r is then : 

су? + сз? маз сз? 
dli 20,0 
Here с; and a, are the S.D.s of the F, and F, distributions, 
as before. The distribution whose S.D. is сз is obtained 
by counting the total frequencies, not in the rows or 
columns, but along the diagonals. We start from the top 
left-hand corner and proceed to the bottom right-hand, · 
counting all the entries along the diagonals at right angles 
to this line as we go. On the diagonal running from 


а = + 1, а, = — 4 to а, = + 8, a = — 2, there is one 
entry. The next diagonal runs from а, = 0, а, = — 4 to 
ау = + 3, а, = — 1; this takes in two entries. The next 


diagonal includes no entries, the next 4, and so on. The 
complete distribution, which we shall call з, is given m 
the following table. We choose an arbitrary mean, and 
work out its S.D. in the usual manner. 


Е, а Е, ` куё 
1 +5 +5 25 
2 4 8 82 
о 8 0 0 
4 2 8 16 
6 1 6 6 
10 о + 27 о 
8 9 8 8 
2 2 4 8 
1 8 8 9 
1 4 4 16 
0 5 0 0 
1 6 6 86 
86 — 25 156 
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XE, 27 — 25 =F,d? 156 
3 = wg = 78888. 


N= ge = 00555 —N 


сз? = 4-3333 — (0-0555)? = 4:880. 

We already know сү? and оз? to be 3-990 and 5-898. 
Hence : ј 

3-990 + 5:898 — 4380 _ | .5т8, 

x 2 x 1:997 x 2-429 
This result is identical with that obtained above. 

Rank-order Correlation.—A, different method for deter- 
mining correlations is available for data arranged in order, 
for example the orders of merit of a school class on two 
examinations. When the number of cases is greater than 
about 30, the figures involved become troublesomely big, 
but with a small number the method is much easier to use 
than the product-moment method. Often, therefore, sets 
of scores are turned into rank orders and the correlation 
between these orders is computed. The formula is as 
follows : 


вх e 

“= 1'— N(N:— 1) Я 

Where d is the difference between each person's two ranks. 
Such a correlation is often indicated by the Greek letter 


P (rho), instead of by 7. 


‚ Ех, 38.— The second and third columns of the following table 
give the marks of eleven pupils on tests X,and X, Тһе next 
Pupil М 1 di 

У Ж Ranks a 

A. . e 78 6 4 2 4 
БЕ р . 85 66 1 7 6 
na 64 58 7 9 | 5 4 
E 2 78 82 Б] ^ 1 1 

У 60 8 8 
E то 80 5 2j 21 6:25 
Š : 88 80 2. 2h i 0:25 
Th 55 71 10° 5 2 
dp 50 - 58 11 9 
K` 72 55 4 11 T 49 

i 59 68 9 6 Е E 

142-5 
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two columns show the ranks, pupil A being 6th on Ху, 4th on 
X, and so on. The d column gives the differences in rank, 
regardless of sign. These are squared and summed. Sub- 
stituting in the formula : 
rep 5X _ 
11121 — 1) 
Clearly the bigger the differences in rank positions, the 
bigger will Xd? be, and the lower the value of r. If 674° 
is greater than N(N? — 1) the correlation will be negative. 
Now there is no diffieulty in turning scores into ranks 
when, as in X, in the above example, every person gets а 
different score. But when two (or more) persons tie with 
the same score, they then divide the two (or more) ranks 
between them, as in Xj. Thus pupils F and С must share 
2nd and 3rd place, ¿nd are both called 23, whilst the next 
highest pupil, A, is called 4. Again, C, E, and I share the 
8th, 9th, and 10th positions, and are all called 9, whilst J, 
the next highest, is 11. Such sharing is undesirable in that 
it distorts, and generally decreases, the size of the corre- 
lation. It follows, therefore, that the rank-order method 
should not be applied when there are a great many ties. 


1 0-648 = + 0:352. 


OTHER METHODS or MEASURING CORRELATION 


The nature and the main uses of other methods will be 
indicated briefly, but they will not be described in detail. 
Full aecounts of them may be found in more advanced 
textbooks. 

Spearmam s Foot-rule Method.—This, like the rank method, 
is applicable to data arranged in rank order. It is perhaps 
even simpler than the ordinary rank method, but the co- 
efficients which it yields range only from + 1:0 to — 0:5, 
not from + 1:0 to — 1-0, and they often diverge rather 
widely from those obtained by the first two methods. А 
new and easy method, which possesses certain important 
statistical advantages over the rank method, has been 

' described recently by Kendall, but has not yet been very 
widely used (cf. Kendall and Babington Smith, 1938). 
Correlation Ratio.—This is denoted by the symbol у (eta), 
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Ro of by т. We saw in Fig. 21 (р. 78), that the ticks 
con те tend to be grouped along а straight 
зен . This is referred to as ‘linear regression.’ When 
be regression is non-linear, that is, when the ticks tend to 
Satan round a curve, а product-moment correlation 
- о show the closeness of relationship, and the correla- 

ratio is used instead. For example, when tests of 
Motor perseveration are compared with assessments of 


CHARACTER RATINGS — 


MOTOR PERSEVERATION SCORES —? 
rrelation. 


Fra. 22.—Scattergram of non-linear со: 
nd that bcth the persons 
d those with low scores 
eas those with medium 


ssments. This type of 
am, 


саць character traits, it is fou 
are nf igh perseveration scores an 
а ten weak in character. wher 
rel; S get the best character asse ; 
ig кр is illustrated in the accompanying scattergr 
. 2. 
Coefficients of Association and Colligation—There are 
many variables which we may wish to compare, but to 
Which we cannot apply any of the correlation techniques 
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because they are not amenable to precise measurement. 
One instance is the colour of people's eyes. We can assign 
people to one of a number of more or less distinct groups 
or categories according to their eye-colour, even if we` 
cannot measure the colour. The same can be done with 

hair-colour, and the two colours may then be compared. 

Ап appropriate technique will tell us the extent to which 

brown eyes and dark hair, blue or grey eyes and light hair, 

are associated or go together with one another. А further 

instance of a so-called ‘categoric’ variable is sex. We 

might find that more men than women in a university take 

scientific courses of study, and that more women than men 

take literary courses. The coefficients of association or 

colligation would then provide a measure of the corre- 

spondence between sex and scientific versus literary 

interests. 

Ex. 39.—Take the data from Ex. 86, and suppose that we 
do not know the actual marks, but only have the categories 
‘top half’ and ‘ bottom half’ on both examinations.* The 
:ollowing table may be drawn up: 


GEOGRAPHY 


Bottom Half 
10 or under 


Top Half 
11 or over 


Тор Half 
15 or over 


Bottom Hal 
10 or over 


HISTORY 


Yule’s coefficients of association (Q), and colligation 
(о), are given by the formule : 
ad — be 


~ Vad + Voe 


1 There is no necessity for the proportions to be equal, as they are 
in the present instance. 
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Where а, b, c, and d are the four frequencies in na 


the table. nu 


Substituting, we find О = + 0-742, о = + 0-444. Ifall 
the pupils in the top half on one examination had been in 
the top half on the other, the table would have been 


dH and both coefficients would have been 1:0. 


If there was no relationship, and the table had been 


sls both coefficients would have been zero. But 


With intermediate distributions the coefficients are clearly 
Dot comparable either with one another, or with the correla- 
Чоп coefficient, 1 

at is known as tetrachoric т may sometimes be applied 
to these categoric classifications,! and tkis is supposed to` 
* comparable to a product-moment coefücient. But it 


is far more difficult to compute than the above two co- 


efficients. In the present instance it works out at + 0-642. 
The diserepancy between this figure and our r of + i 
2 Probably due to the small number of cares an e 


"regularity of their distribution. 
‘serial r.—When one of the variables to be compared 
as been accurately measured and has yielded a normal 
distribution, but the other (though believed to be normally 
‘Stributed) ean only be expressed in the .form of two 
Categories, biserial ғ may be used. For instance, if we 
Ish to compare intelligence-test scores with success e 
failure in a scholarship examination, We may not know 
° examination marks, but we do know the intelligence 


Se 
ores of scholars and non-scholars. 


the Letrachorie т сап only be used when it Сал id pu Den 

А x oe be compared, although Пака aum e is 

tr O classes, are really normally distributed. EAT 
TUR examination marks, but not of sex or eye-colour à 
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Ex. 40.—Suppose that we have all the history marks of 
Ex. 36, and that we know who fall into the top half, who into the 
bottom half, for geography. The average history marks of 
the top geographers is 16-4, that of the bottom ones is 11:9. 
From the difference in averages, 4-5, from the S.D. of the 
history marks, and from the proportions of pupils in the two 
groups,’ we can find biserial 7 between history and geography 
marks to be + 0-722. 


Biserial r often fails to agree closely with product- 
moment r. In this instance the small size of W, and the 
irregularity of the distributions, accentuate the discrepancy. 

Contingency.—When the data are grouped into two or 
more classes, but cannot be regarded as normally dis- 
tributed, the measurement of correspondence is carried 
out by. quite a different method, based on testing for good- 
ness of fit, as described in Chap. V. For instance, we 
might wish to compare political and religious affiliations. 
We might classify a large group of persons into conserva- 
tives, liberals, socialists, communists, and re-classify into 
Roman Catholics, Anglicans, Free Church Protestants, 
Jews. We should then have a 4 x 4 table, to which the 
following method would apply. 4 

We start by assuming that there is no relationship 
between the variables, and compute on the basis of this 
hypothesis how many persons would be expected to fall 
under each of our 16 categories. The actual frequencies 
aie then compared with these Fs. If they diverge from 
one another to an extent which cannot possibly be ascribed 
to chance errors of sampling, in other words, if the actual 
F’s entirely fail to fit in with our initial hypothesis, then 
we have evidence suggesting that there is a relationship 
between the variables. 

Ex. 41.—Take the data of Ex. 36, and suppose that the his- 
tory marks consist only of A, B, C grades, the geography marks 
of A, B, C, D grades. ` The following 3 x 4 table is obtained. 

Now if there was no relation between the two sets of marks; 
then we might expect the 13 pupils with A's in history to be 


! There is no necessity for the proportions to be equal, as they 
are in the present instance. 
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GEOGRAPHY 


HISTORY 


. TanLE oF F's. 
distributed, as regards their geography grades, in the propor- 


tions 7:11: 11:7. That is, the F, in cell AD is ES Т 


, the 


F. in cell AC is ЛЕ and so on. Similarly the F,’s for the 
T. 6 
Pupils with B in history are 161 in cell BD, xu 
Be and so on. We arrive then at the 
УЗ procedure now consists, as before, 
Er and the sum of these quantities, 


in cell 


following table of F,’s. 
in computing F — F, 


chi squared. 


ТАВЉЕ or (F — Ё). TABLE OF 
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= = x° = 16:81. From tables of x? it is found that 
the probability of this value of x? is 0-01, which means that 
there is only one chance in a hundred that our obtained F's 
could deviate as they do from the expected К through chance 
errors of sampling. . 

Chi squared may now be expressed as a coefficient, C, which 
is called the coefficient of mean Square contingency. 


8 ж ЕЕЕ = ИНЕ c E 
С= Va + Substituting, C = N 86 F 1673 = 0-563. 


In this example C happens to be quite close to the r 
which we obtained previously, but they are not usually 
closely comparable to one another, for two reasons. First, 

` the maximum possible value of C is always less than 1:0, 
by an amount depeading on the fewness of the number of 
cells. In this instance, with only 12 cells, C cannot exceed 
0.775. Hence it is best, whenever possible, to adopt a 
finer classification, e.g. 5 X 5 categories, which will yield 
а maximum С of 0-894, 

Secondly, we have seen that C is, strictly, a measure of 
divergence, and such divergence does riot necessarily prove 


а relationship. Suppose for example that our original 
table of F’s had been : 


TABLE or F's, 


Th 


„there is now по appreciable correspondence between 
high scores in history and high scores in geography. Yet 
the contingency coefficient would be precisely the same as 
before. Contingency should then be used as a measure 
of relationship only with great caution, if at all, 


gc 
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The Matching Method.—One context in which C is useful 
is in what are known as matching experiments. Suppose 
we wish to find whether style of English composition reflects 
the character of the writer. "We would get, say, ten per- 
sons each to write a composition on a given theme, and 
then obtain from acquaintances as full and accurate 
descriptions as possible of the character of each writer. 
The character descriptions and compositions (which should 
be anonymous) would then be compared by a group of 
individuals who act as judges or matchers. Each judge 
would try to match or identify each composition with the 
appropriate character description. Some judges might 
ascribe A's composition to A, others to B or C, and so on. 
We would therefore arrive at a 10 x 10 table, similar to 
the continge ^y tables above, giving the frequency with 
which each composition was assigned to each character. 
By а modifieat’>.. of the mean square contingency tech- 
nique, the асе: icy of matchings, i.e. the extent to which 
the compositions appear to the judges to be related to the 
writers’ characters, may be expressed as a coefficient, c 
The present writer has described this technique and its 


applications elsewhere (Vernon, 1986). 
У here is an unfortunate tendency 


Analysis of Variance.—T. nate 
in contemporary educational and psychological investiga- 


tions to make over-much use of correlation techniques, 
when other methods of statistical treatment of the results 
might be more appropriate. One very powerful tool which 
has so far been but little applied to educational statistics 
is known as the analysis of variance. This involves rather 
more advanced theory than we have space to discuss,” 
but the following is an instance of its application. 


X = iter’s men students were divided, according 
Pee э ines of training, into seven sectione ioana 
numbering about 22 students. The average level of a yd in 
different sections seemed to differ considerably, а their 
average percentage marks on examinations 1n psychology are 


1 A good elementary description is given by Thouless (1937). See 


also Fisher (1930). 
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listed in the second column of the accompanying table. The 
third column shows, however, that the marks in any one section 


Section Mean Psychology S.D. of 
Mark Marks 

A 72-89 8:25 
B 68-69 8-48 
С 65:59 7-88 
D 64-41 7-05 
Е 68:95 8-58 
Е 68:52 10:71 
с 68-48 10:71 


mere very variable, usually ranging from about 459/, to 85%. 
Thus in order to answer the question whether there exists any 
relation between a student's ability and the section to which 
he is assigned, we must determine whether these differences 
between the means of the sections are significant or whether, 


differences might arise merely through chance fluctuations of 
sampling. “ Variance ” is simply another term for spread, 
dispersion, or variability (actually the variance of a distribution 
of measures is the square of its S.D.). The technique of 
analysis of variance enables us to say how far these differences 
beiween sections oan be ascribed to the effect of individual 
differences within sections. In this instance the probability of 
such differences between means arising by chance is quite small, 
approximately 0-01, so showing that there is a definite tendency 
for better students to be put into some sections, poorer students 


THE PROBABLE ERRORS OF CORRELATION COEFFICIENTS 
А. correlation coefficient is, like any other numerical 
psychological result, liable to vary on account of sampling 
errors, especially when the sample of persons, whose 
Scores are inter-correlated, is of small size. The formula 
for the P.E. of a product-moment coefficient is : 
ai — уз 
P.E, = 9674501 — г) 
VN 

Applying this to our 7 of - 0-579, where N = 86, we get 
P.E. = + 0:075. Yt is usual to write the P.E. after the 


1 In Fisher's terminology z for the differences in variance is 0-610. 
Degrees of freedom are та = 6,2, = 145. A value ofz of 0-54 would 
under these conditions have a Probability of 0-01. 
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coefficient, thus: т = 4- 0-579 + 0-075. The Р.Е. of a 
rank order coefficient is slightly greater than that of a 
product-moment one, and is usually calculated by the 
formula : 
0-7063(1 — r?) 
Р.Е.р = = —_ 
i VN 


For the determination of the P.E.s or S.E.s of the other 
coefficients mentioned above, the reader should consult 
ae advanced textbooks, such as Kelley (1924) or Guilford 
1986). 

Тће connotation of the P.E. of r or p is similar to that of 
ап average or difference (cf. Chap. У). Thus 4- 0:579 
+ 0-075 signifies that if the same history and geography 
papers were set to much larger numbers of similar children, 
and the marks inter-correlated, the true value of r so 
Obtained would be as likely as not to lie between + 0-654 
and + 0-504 (ie. r+ 1 X P.E. and r — 1 X P.E.) but 
that there is a 1 in 370 probability that it might deviate 
as widely as ++ 0-916 or -+ 0-242 (44 x P.E.). i 

Just as a difference between averages should be three or 
тпоге times its S.E. to be accepted as statistically signi- 
ficant, so a correlation should be 44 times its P.E. before 
it is regarded as showing а real relationship between the 
inter-correlated variables. A coefficient of, say, + 0:80 
+ 0-10 is on the borderline of significance. For if the true 
value of r was zero, the scores of a limited sample of persons 
might yield + 0-30 once in 44 times. 

In our previous illustration + 0:579 is much greater than 
43 x P.E. Yet, as we have seen, this value is far from 
trustworthy, the reason being the small size of N. Most 
investigators consider а correlation calculated from less 
than 25 cases to be almost worthless because of its low 
reliability, and they prefer to have at least a hundred cases 
before they base any important conclusions on the results 
of an experiment involving correlations. 

Fisher’s z Meihod—The P.E. method of indicating the 
reliability of r has been shown to be somewhat inaccurate, 
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especially when N is small or 7 is large. Fisher (1980) 
presents a better method which is being widely adopted. 
It consists in transforming r into а new quantity called z, 
whose true P.E. is easily calculated. Applying this method 
to our r = + 0:579, we find that the limits within which 
true r is as likely as not to lie are almost identical with the 
limits derived, above, from P.E,, but that the extreme 
limits (43 x P.E.) should be -+ 0-881 and + 0-182 instead 
of + 0-916 and + 0-242. 

Combined and Comparing Correlations.—The z method 
also provides a useful way of combining two or more 
correlations. 


Ex. 48.—Suppose that another class of 48 pupils had taken 
the history and geography papers and yielded a coefficient of 
+ 0:456 + 0-082, whit is the best estimate of the true correla- 
tion? The procedure is to determine жү and 2, for the two r's, 
and then to compute the weighted average, 

2(N, — 8) + 2 (КМ, — 8) 
N,+N,—6 
Translating back from the average z gives us our answer, 7 = 
+ 0:512 + 0-057. 


It is often required to find whether a correlation in one 
group is appreciably different from that in another group, 
or whether the difference can be ascribed to errors of 
sampling. When the numbers of cases are large the stock 
method may be used, c, = VS.E.r? F S.E-r,3. 


Ex. 44.—What is the significance of the difference between 
+ 0:579 + 0:075 and + 0-450 + 0:082? The S.E.s of these 
coefficients are the I. , ie. 0-1108 and 0-1216. о, = 0-1645. 
TThe difference is 0-579 — 0-450 — 0-129, which is not as large 
as its own o, А difference as large as or larger than this would 
occur 48% of times as a mere matter of chance. 


As the numbers are small, Fisher's z method should pre- 
.ferably be used. This consists in computing z,, 2, and their 


12 ор, (1 + 7)— log, (1 n The S.E. of = <= y 


Fisher provides tables of z for various values of r. 


MEASUREMENT OF CORRELATION 129 
S.E.s, and applying the stock formula to these S.E.s. The 
difference in 278 is-|- 0-1763 in the above example. S.E.2,= 
0-1741, S.E.z, = 0:1581. с; = Молтала + 0:1581? = 
0:2352. Here the difference is 0:75 times its сл, and it 
might occur 45% of times as a mere matter of chance. 


CHAPTER VII 


INTERPRETATION AND APPLICATIONS OF 
CORRELATIONS 


À. CORRELATION coefficient is never an end in itself, but 
always a means towards the proper interpretation of 
numerieal data. By studying the correlations between 
tests of various abilities it is possible to analyse scienti- 
fieally the nature of such abilities, This is one important 
applieation, which we shall consider in Chap. VIII. But 
the primary value of a correlation is that it enables us to 
make predictions. If abilities X, and X, are known to be 
inter-correlated, then we can use measures of X, to predict 
Scores on X,, and vice versa. For instance а pupil's 
probable scholastie success may be forecast from his per- 
formance on an intelligence test. "This cannot, of course; 
be done with complete accuracy, but the correlation co- 
efficient will also tell us how accurate is the forecast, ОГ 
what are its limits of error. 


Ex. 45.—The following scattergram shows the Intelligence 
Quotients of 440 children and adults om the Stanford-Binet 
Scale, and their I.Q.s on the Vocabulary test from this scale. 
(The group or category 140 includes all testees with 1.0.5 
around 140, i.e. ranging from 135} to 1453; similarly with the 
other groups.) Here the correlation is quite a high one, namely 
+ 0.799 + 0-012. Hence we would be justified in testing people 
only with the Vocabulary test, and predicting from their results 
what would be their Binet 1.0.5. Let us see how to make such 
predictions, and how good they would be. . 


Suppose a child to obtain a Vocabulary I.Q. of 120, his 
Binet І.9. may, according to the scattergram, lie as high 
as the 140 class, i.e. up to 145, or as low as the 100 class, 

130 
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X, — Vocabulary I.Q. 


X, — Binet I.Q. 


ie. down to 96. But presumably its most probable value 
| Will be the average of the column : 


X, F, 
140 3 
130 12 
120 16 
110 9 
100 8 


This works out at 118-5. Similarly for a child whose 
Vocabulary I.Q. is 70, the most likely value of his Binet 
LQ. is 75-0, though it may lie anywhere between 56 and 95. 

Along the bottom of the scattergram are given the 
averages of the columns, i.e. of the X, measures which most 
probably correspond to the X, measures at the heads of the 
columns. Fig. 23 is a graph of these figures. It will be 
Seen that the points on the graph lie approximately on a 
Straight line which runs through X, = 100, X, = 100. 
That they do not all lie exactly on the line is due merely 
to irregularities in the data. For though the numbers in- 
| volved in our example are large, they are still not large 
enough to yield really smooth normal distributions, or 


regularly spaced averages. 
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Now as we approach the extreme values of X», the corre- 
sponding values of X, tend to lag behind ; 118-5 is smaller 
than 120, 138 is much smaller than 150. Similarly with 
values below 100, 75 is not so low as 70, 69 is much less low 
than 60. This is an important and characteristic feature 
of all correlations. It is known as the regression of X; 
towards the mean. The line drawn through the points on 


dx 


60 — 


1 | 1 | 1 1 | | | d 
50 60 70 во 90 100 IIO 120 130 мо 15 


X; 


Ета. 23.—Graph showing the most probable values of X, for testees who 
obtain each value of Х,. 


the graph is called a regression line, or а graph showing the 
regression of X, on Х,. 

Suppose now that the correlation is a perfect one. The 
scores on X, and X, would be identical, and the slope of 
the regression line would be 45°. Next suppose that there 
is no correlation at all between X, and X,. Tt would then 
obviously be impossible to predict scores on one variable 
from those on the other. If we took successive columns of 
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the scattergram and averaged them, we should find that 
all these means lay close to the mean of Ху. Whatever the 
value of the Vocabulary I.Q., the average Binet I.Q. would 
be round about 100. . In other words, the regression line 
would be horizontal. From these two hypothetical cases 
we can now see that the nearer the slope of the regression 
line is to 45?, the higher the correlation ; the nearer it is 
to horizontal, the lower the correlation. Furthermore, 
when the correlation is perfect, there is mo regression 
towards the mean, since each X, value is as big as each 
X, value. But when the correlation is zero, the regression 
is complete, since whatever the value of X, the value of 
X, is always the mean. Thus the greater the regression, 
the nearer the correlation approaches zero. 3 

ТЕ has been assumed in the previous paragraph that the 
X, and X, measures are expressed in equivalent units. If 
the dispersions of the two distributions are very different, 
then the slope of the regression line corresponding to perfect 
correlation would not, of course, be 45?. However, as, 
shown in Chap. IV, measures can be turned into equivalent 
units if they are expressed as deviations from their means, 


and divided by their standard deviations, i.e. as 5 and P 
1 2 


If this is done the slope of the regression line is identical 


4 alo; : 
with r. When the slope is 45°, es = 1:0, that is perfect 
[G5 


А А ; ao, _ 
correlation. When the slope is horizontal, = 0-0, 
[05 


that is zero correlation. 
The slope of the regres 
9; and c, are 17:49 and 1 
Slightly more spread out thar 
p _ 0°80 x 17:84 _ 0.815. This is very close to the figure 
17:49 
already quoted for the согтеја 


product-moment method. As t i 
r was originally determined by plotting the regression line 


10 


sion line plotted in Fig. 28 is 0-80. 
7-84 (i.e. the Vocabulary I.Q.s are 
n the Binet 1.0.5). Hence 


tion, as determined by the 
a matter of historical fact, 
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and measuring its Slope, before statisticians invented the 
product-moment method. The latter method is always 
employed now because, as we have seen, the regression line 
obtained by averaging the columns is somewhat inexact, 
even when N is large. It is more useful to reverse the his- 
torical procedure, to calculate всу, бр, and т, and from them 


; А 2 00; 
to determine the regression line. If еч =r, then 
201 
6; ES 5 ‚ : 
ау = тау This is the algebraic equation of the regression 


2 
line, and it сап be used for predicting any individual's most 
probable а; score from his а, score. Alternatively, if we 
wish to predict in terms of original measures, instead of 
measures expressed as deviations from their means, the 
formula becomes : 


С; 
(X, — M,) = (X, — My). 


2 . 

` Ex. 46.— What is the most probable Binet 1.0. corresponding 

to a Vocabulary 1.0. of 140? According to the averages listed 

at the foot of the scattergram, the answer is 128-3. But this 

is based on only 6 cases and is therefore very unreliable. 
Applying the formula : 

X, — 100 = 0-799 (140 — 100) x NS. 

1 ЊЕ к 17-84 
X, = 181-8 is the correct answer, 

Regression of X, оп X,.—8So far we have dealt only with 

the prediction of Ху 

knowing X, The 

opposite is equally 

Regression POSSible. On the 

of X}0nX2 right-hand side of the 

scattergram in Ex. 45 

are listed the average 

values of X, which 

constitute the ‘best 

estimates of the Voca- 

% bulary I.Q.s of persons 

Fic. 24.—Regression lines whenr = + 0-799. With various Binet 


x Regression 
Џ of Ху on 
X, 


7 
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I.Q.s. Here, too, there is regression towards the mean. 
The probable Vocabulary I.Q. corresponding to a Binet 
I.Q. of 120 is 115-6, and for a Binet I.Q. of 70 it is 74-0. 
A second graph may therefore be drawn, showing the 
regression of X, on Ху. Fig. 24 shows the two lines 
diagrammatically. If the correlation was perfect the two 
lines would coincide, 

both having a slope of 

45° (provided, always, Х! Regression 
that X, and X, are бгаа 
measured іп equiva- 
lent units). If the 
correlation was zero, 
the second line would 
be vertical, as in Fig. 
25, since the only 
possible estimate of 
Vocabulary 1.0. from 
Binet 1.0. would be : 
100. With an inter- Fic. 25.—Regression lines when r = 0:0. 
mediate correlation, 

such as our -+ 0-799, the second line makes the same slope 
with the vertical as does the first regression line with the 
horizontal; and the two cross at the means of X, and 
X, Thus the algebraic equation of the second line, 
which is also the formula for predicting X; from X, 


Regression 
of X, on X; 


X 


у с 4 Е 
will be: а, = 72-2, ог, in terms of scores, (X4 — M5) = 
в 


(X, — м). 
1 бү 
Ex. 47.— What is the probable Vocabulary 1.0. of а testee 


А 3 ? E 17.84 
with Binet LQ. 80? X, = 100 + 0-799 (80 — 100) X 1715 


= 88-7. The estimate based on the scattergram was 82-4, 
Which agrees quite closely with our answer. 


P.E. of Estimated Scores.—The regression equations will 
tell us the most probable estimate of X, from Хр or of X, 
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from X,. But it is obvious from the scattergram that the 
actual X, may fall within quite a wide range on either side 
of the estimated Ху, and that similarly an estimated X, is 
only the average of a wide range of possible X,’s. The 
ranges may be determined from the formule : 


P.E.estim.x, = 0:6745о,у1 — ri 
P.Evcstim.x, == 06745 У 1 — 7% 


Applied to Exs. 46 and 47: P.Evestim.x, = 0°6745 X 
17:4941 — 0:799: = 7:1, P.E ах = 78. 

The P.E. of an estimated score has a slightly different 
meaning from previous P.E.s. It is best shown by the 
following example. Considera very large group of persons 
whose Binet I.Q.s are 110, and whose estimated Vocabu- 
lary LQ.s 108-1 + 7:3. Then опе half of them should 
have Vocabulary I.Q.s lying between 115-4 and 100-8, and 
P"-ctically all of them should lie within a range + 44 X 
P.E., i.e. between 140-9 and 75-3. 

Actually we al-eady know the Vocabulary I.Q.s of 71 
persons with Binet LQ.s round 110. This is not a very 
large group, but it should serve to test out our prediction. 
The scattergram is too coarse, but looking back to the 
original scores it appears that 37 of them, or 52%, had 
Vocabulary I.Q.s between 115 and 101 (inclusive), and 
that the two extreme Vocabulary I.Q.s were 188 and 80. 
This agreement is quite closc. 

It is important to note that the P.E. of an estimated 
score decreases as 7 increases. If the correlation were per- 
fect there would be no error at all in estimating X, from 
Xy or X, from X,. The lower the correlation, the less 
accurate will be the predictions. 


Ex. 48.—As an additional illustration, take the results from 
the 36 history and geography papers. Here N is far too small 
for accurate regression lines to be obtained by averaging rows 
or columns. But by the use of the formule given above, 
probable geography marks can be estimated from history marks, 
or vice versa. For instance : what geography mark (X5) corre- 
sponds to a history mark (X,) of 7, and what is its ERBI? 
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М, = 144189 М, = 9:945 
o, = 3:903 gy = 4-819 . 
т = + 0-579 . à 
= is ES 4:819 
(X, — 9:945) = 0-579(7 — 14-189) X 5:503 
Х, = 9:045 — 5:108 = 4-84 
P.Evestim.x, = 0°6745 X 4-8194/1 — 0-579? 
= 2:65 
Therefore X, = 4:84 + 2-65. 
« Here the correlation is only moderate, hence the estimate 
is poor in reliability. All we can claim is that a pupil with 
a history mark of 7 would be very unlikely to get a geo- 
graphy mark as high as 17 (X, + 44 x P.E.), and would 
most probably get round about 2 to 7. 


INTERPRETATION OF CORRELATIONS 


In the light of the above discussion it is possible to give 
a much more precise meaning to the conception of corre- 
lation than heretofore. A correlation does not mean, а> ‘< 
Sometimes supposed, the percentage agreement between 
the correlated variables. In the scattergram on p. 181, for 
example, only 374% of the testees get the same І.9., or 
rather the same class of I.Q., on Binet and Vocabulary, 
although the correlation is + 0:799. It is possible to 
interpret a coefficient in terms of the number of factors or 
elements which are common to the two variables, which 
therefore cause them to correlate with one another, but 
there is little practical point in so doing. What a correla- 
tion does indicate is the amount of reduction of error in pre- 
dicting scores on one variable from scores on the other variable. 
When + is zero, the error is at its maximum, namely 0:6745с. 
Predietions based оп Ху might fall anywhere within the 
whole range of Х, scores, and would be no more accurate 
than drawing X, scores out of a hat. With a correlation 
of 1-0 the error is reduced to zero. The extent of reduction 
of error for intermediate 7's is shown in the following table. 


The second column lists values of 100 (1 — VI — 7%), and 
SO represents the reduction in percentage terms. These 
1 Cf. Garrett (1937), pp. 348 ff. 
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figures are sometimes referred to as the forecasting efficiencies 
of the r's. It will be seen that the forecasting efficiency 
only becomes high enough to be of great practical value 
when r is in the neighbourhood of 0-8 to 0-9 or more. 
Great caution should be exercised therefore before predict- 
ing, say, scholastic success in a secondary school from 
elementary school examinations or from intelligence tests, 
Since these usually give correlations of only + 0-4 to + 0:6 
with what we wish to predict. Such deductions are only 
8% to 20% more accurate than pure chance guesses. 


Reduction of error, or fore- 


d casting efficiency % 
0:00 . 
0-10 0:5 
0-20 2:0 
0:30 46 
0:40 8-4 
0-50 18-4 
0-60 20-0 
0:70 28:6 
0-80 40-0 
0-90 56:4 
0-95 68-8 
1:00 100-0 


The Influence upon Correlations of Irrelevant Common 
Factors.—A. positive correlation which is statistically signi- 
ficant (i.e. 8 or preferably 43 times its P.E.) always indicates 
some kind of causal connection, or some common factor or 
factors running through the two variables—even when the 
connection is not strong enough for us to be able to forecast 
one variable from the other. Considerable care is needed, 
however, in interpreting such a connection, for its real 
nature may be quite unsuspected, or the common factors 
may be of an entirely irrelevant kind. Suppose, for 
example, that we correlated the size of boots worn by all 
the children in a school with the speed of their handwriting, 
we should undoubtedly obtain a moderately high coefficient. 
Yet it would be absurd to say that big boots cause quick 
handwriting, or vice versa. The real reason for the corre- 
lation is an irrelevant common factor, in this case age. 


INTERPRETATION AND APPLICATIONS 159 


The older children on the whole write more quickly and 
have larger feet than the younger ones. If we took chil- 
dren all of the same age and compared boots with hand- 
writing, we should be likely to find that the correlation had 
fallen to zero. 

Consider another, rather more complex, illustration. 
A positive correlation exists between backwardness in 
School work and unemployment among the children's 
parents. Idealists may at once jump to the conclusior. 
that unemployment brings about malnutrition and anxiety 
among the children, and that these make their work fall off. 
But here also there is an important common element which 
goes some way to explain the connection. Unemployed 
parents are known to be on the whole less intelligent than 
employed ones, and unintelligent parents tend to have 
unintelligent children, and low intelligence is a far more 
important cause of backwardness than is malnutrition. 
We need to take two groups of children, one with pai-^ts 
employed, one unemployed, groups whose average intelli- 
gence level is identical, and then study their school work 
to see whether the former do better than the latter. Only 
when we have ‘ held the intelligence factor constant ’ in this 
example, or * held the age factor constant ' in the previous 
example, can we legitimately deduce some causal connection. 

Partial Correlation.—The above type of problem or diffi- 
culty in interpretation is constantly occurring in psycho- 
logy and education. Sometimes it can be met by applying 
what 15 known as the partial correlation technique. For 
this it is necessary to obtain measures of the suspected 
irrelevant factor, which we shall call X,, ага to work out 
the correlations between X; and X; (ла), and between X, 
and X, (723), as well as between X, and X, (пл We 

ently pointed out that thi i 
Sn sie ved. шу legitimately be used if X; A 
measured without error. It is thus most appropriate for holding 
say, intelligence ; since measurements с 


age constant, but not for, 
gi , nvolve some error. However, Thouless provides 


the latter always i ‹ s 
an alternative опа, which may be used when the reliability of 


X, is known. 
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may then ‘hold X, constant > or “partial it out’ by the 
following formula : 72.3 (meaning the correlation of X, 
with X, when X, is eliminated) — 
à cem RAM И 
УІ — ng V1 — ns. 


Ex. 49.—Suppose the correlation between boots and hand- 
writing speed to be + 0-45, that between boots and age + 0:70, 
and between handwriting and age + 0-60. Then the correla- 
tion between boots and handwriting when the influence of age 

0-45 — 0-70 x 0-60 " 

SS À = 0-052. 
V1 отоу — gg = + 005 


The Effect of Homogeneity and Heterogeneity upon Corre- 
lations.—We have just seen that when pupils are hetero- 
geneous as regards age, the correlation between a pair of 
their characteristics may be spuriously raised. The same 
is true of other types of heterogeneity. Suppose we calcu- 
late *'.e correlation between two tests in an ordinary school 
class, and then eliminate from consideration most of the 
pupils of medium ahility on both tests, and re-caleulate the 
correlation, the second figure will certainly be larger than 
the first. The distributions will not, of course, be normal, 
the diversity or heterogeneity of the pupils will be unduly 
large and this will increase the size of r. In ordinary 
practice the reverse, namely undue homogeneity, is more 
likely to occur without the investigator realizing it. Thus 
the correlations between intelligence tests and school work 
in an ordinary school class are usually lower than they 
would be in an unselected group of children of the same age. 
For such a schoal class is more homogencous as regards in- 
telligence and school work than is the population asa whole. 
The dullest children of that age are likely to be at a special 
school, and many of the brightest ones may be at private 
schools. Take an extreme case : if we selected a group of 
children all of precisely the same intelligence, any correla- 
tiun between intelligence and work would disappear. One 
important reason why intelligence tests appear to be of 
much less value for predicting scholastic aptitude among 


is removed is: 7 = 
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Secondary grammar than among primary pupils, and to be 
poorer still among university students, is that secondary 
Pupils are more homogeneous or more highly selected than 
primary. Most of the children of I.Q. less than 100 are 
weeded out before the secondary stage is reached, and 
university entrance examinations remove many more from 
the lower end of the scale. Thus correlations necessarily 
Sink as we pass from unselected children to the primary 
School, from the primary to the secondary, and from the 
secondary to the university level. 

It is not easy to decide just what degree of heterogeneity 
Should be regarded as normal and reasonable, nor to specify 
the extent to which any given group of persons exceed or 
fall short of this degree of heterogeneity. Hence the cor- 
rection of correlations either for undut homogeneity or 
€Xcessive heterogeneity is complicated. The reader may 
be referred to discussions by Kelley (1924), Sections 62—64, 
and Thomson (1939), p. 172 f. ~ 
. The main conclusion to be drawn from the above sections 
15 that the investigator should always scrutinize his corre- | 
lational data very carefully, first in order to see whether 
апу irrelevant common factors may be increasing spuriously 
the size of his coefficients, secondly to judge whether or not 
the diversity of scores in the group tested is unusually wide 
Or restricted. He should realize that the absolute size of 
а correlation means very little, since this size may be sc 
much altered by uncontrolled common factors or hetero- 
geneity. His interpretation of the amount of rclationship 
between the correlated variables should be made relative 
to the particular data with which he is working. 

In contrast to the absolute size, the predictive value or 
forecasting efficiency of a correlation is but little affected 
by heterogencity. For the latter, as we have seen, depends 
On см — p, Hence an increase in r on account of 
excessive heterogencity may be offset by the increase in o, _ 
and the P.E. remain much the same. It should be noted 

‚ further that forecasting efficiency is not in the least upset 
by irrelevant common factors. True, it might be foolish 


л 
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to predict speed of handwriting from the size of boots, yet 
the correlation obtained between these variables would 
provide a valid means of making such predictions. 


CowBrxiNG Two or More Tests 


The educational statistician is frequently faced with the 
problem of combining the results of several examinations 
and tests, and using the total scores for the selection of the 
best pupils. We have already dealt rather fully with the 
point that the distributions to be combined should be ex- 
pressed in equivalent units, i.e. that they should possess 
the same means and S.D.s, except when it is desired to 
give more weight to one set of marks than to another, in 
which ‘case the foímer should have the larger S.D., not 
necessarily the larger mean. Here we must draw attention 
to an inportant fact which markers seldom realize, namely 
^ut the S.D. or dispersion of combined variables is always 
Jess than that of the separate variables by an amount de- 
pending on the lowness of correlation between these vari- 
ables. This fact is a natural corollary of the facts of re- 
gression, outlined above. 

Suppose, for example, that total marks are based on the 
average of six examinations, each with a mean mark of 60, 
а range of 50 to 90, and a S.D. of 10. If the average corre- 
letion between the six variables is + 0:70, the final distri- 
bution will probably have a S.D. of 8:66 and the average 
totals will range from 34 to 86. But if the average inter- 
correlation is only + 0-30, the final distribution will have 
а S.D. of 6-43, and the range will drop from 80—90 to 
41-79. It is easy to see why this must be so. When the 
average inter-correlation is moderate or low, no pupil or 
student will get very high marks, or very low ones, on all 
the examinations. Hence the highest and lowest average 
marks will be distinctly nearer to the mean than the highest 
and lowest marks on each separate examination. The 
same phenomenon occurs when marks on different questions 
in a single examination are totalled. For instance, if five 
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questions are marked, each with a range of 5-20 out of 20, 
the examination totals will probably range from about 
40-85, and not from 25-100. 

A chief examiner, head teacher, or other marking 
authority, should make allowance for these facts. If he 
wishes the distribution of final totals to show a certain 
Proportion of very high (over 80%) marks, and a certain 
proportion of low (under 40%) marks, then he must either 
direct the markers of the component examinations to 
employ extra-wide distributions (i.e. to award much larger 
proportions of marks over 80% and under 40%), or else he 
must re-scale the final marks so as to spread them out again 
into a distribution with the required dispersion. The 
formula connecting c, the S.D. of the component distribu- 
tions, and c, , the S.D. cê these distributions averaged, is : 


"m 
б. = с "m IERI) where R is the average correlation 
n 
between all the sets of marks, and % the number of sets! `` 
Multiple Correlation.—1f two tests or examinatioiis eack 
Correlate moderately with a third, then the two combined 
will usually correlate higher with the third than either 
Separately. For example, English and arithmetic examina- 
tions at 11 + usually yield correlations of about + 0:40 
dary school achievement. The com- 
bined mark is likely to correlate about + 0-50 with such 
achievement. Every educationist has realized and applied 
this principle. With the aid of statistical methods we can 
formulate it more precisely. 
Let there be m tests which give correlations 7j, 74, 
Тад + . . ete., with some measure of achievement, A, the 
sum of all these coefficients being 27,5 and let the 
correlations of the tests with one another be 7, 7з, 753 · + ., 


With subsequent secon 


1 Alternatively, Sum may be substituted for cav, where mm is 
n 

the S.D. of the fotal scores. The same formula, turned around, 1s 

often useful for finding the average inter-correlation of n variables 

when their separate and combined S.D.s are known. Cf. Kelley 

(1924), Formula No. 171. 
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ete., their sum being Xr. "Then the correlation of A with 
the combined tests is ; 
B c.a... {= _ ___ En 


Vn a 2 Erg 


Ex. 50.—Three different types of intelligence tests gave 
correlations of + 0:488, + 0-456, and + 0-337 with students’ 
marks in psychology. Хғ, = 1-276. Their inter-correlations 
were + 0-307, + 0-468, and + 0-508, Xr, = 1488. What 
pim correlation of the combined tests with psychology 
marks ? 


nh t: +a) = MM = + 0-522, 


+ 2 x 1488 
1 The above method is a simple but crude way of combin- 
Ing tests so as to give better predictions of achievement. 
Naturally some of the tests are superior to others, and so 
should receive more weight in the total score. Further, 
tho ufiost useful tests will be the ones which, as well as corre- 
lating highly with A, do not correlate highly with the other 
tests that are being employed. For obviously, if tests 1 
and 2 are measuring almost precisely the same thing, they 
Will predlet, as it were, the same aspect of A. Whereas 
What we want are tests which predict different aspects, and 
so between them cover the whole of А more thoroughly. 
Thus it may*be better to leave out test 2, and substitute 
another which, while perhaps correlating less well with A, 
has the advantage of being almost independent of test 1. 
The method known as multiple correlation enables the 
educationist or psychologist to choose from a number of 
tests those which will in combination give the highest 
possible correlation with A, and to weight them suitably. 
Multiple regression equations can also be calculated which 
predict an individual’s most probable score on A by com- 
bining his marks on the separate tests in appropriate pro- 
portions. The method is too elaborate to be given here, 
‘but the above discussion will, it is hoped, indicate its object 
and main principles. | 
The vocational psychologist applies an analogous pro- 
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cedure in selecting the best candidates for a certain occupa- 
tion, that is, he gives a series of tests each of which has been 
found to correlate moderately with success at the occupa- 
tion, and combines them by the multiple correlation method 
So as to obtain as accurate predictions as possible. 


Тезт RELIABILITY 


{ ТЕ а test or examination is applied a second time under 
Similar conditions, and the testees' scores differ widely 
from those previously obtained, the test is obviously a 
poor one. It is said to be reliable only if the two sets of 
Scores correlate highly with one another. Further, if 
different testers apply and score a test or examination, they 
Should arrive at the same, or nearly the same, scores. 
Repetition of a test may, however, give an unfair picture 
of its reliability, since the testees may remember their pre- 
vious responses. Many tests are therefore supplied in two 
or more parallel forms, so that if a re-test is desired, а ел 2»t., 
questions may be set which should, nevertheless, yield 
much the same results as did the questions of the first test. 
When no alternative form is available, a single test is often 
Split into two equivalent halves. For instance, the scores 
on odd- and on even-numbered questions or items may be 
totalled separately, and then inter-correlated. But it is 
known that test reliability depends on the length of the 
test, hence the correlation between one half and the other 
half is unduly low, and is usually corrected by the formula : 


R= 1 = . Here ris the obtained coefficient, and R the 
* 

Coefficient to be expected had it been possible to compare 

the whole of the test with another similar test. 

In-all these instances—namely repetition, application 
and scoring by a different tester, parallel form, and cor- 
rected split-half—we generally expect to get an inter- 
correlation of at least + 0:90. For if this reliability co- 
efficient is much lower it would indicate that the scores aré 
too unstable to be trusted. Standardized educational and 
intelligence tests generally reach this level of reliability, 
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but ordinary examinations often fail to do so, as we shall 
see in Chap. XI. Many performance tests, tests of occupa- 
tional abilities, and character or temperament tests, are 
also often poor in reliability. However, by increasing the 
length of a test, or by combining several parallel tests, à 
more satisfactory coefficient may be attained. 

Knowing the reliability coefficient it is possible to сајси- 
late the reliability, or the limits of variation, of individual 
Scores, by the formula already given:  P.E.ux, = 
0-6745c,V/1 — 75. For example, if parallel intelligence 
tests give a correlation of + 0:98, and the S.D. of their 
LQ.s is 163, then Р.Е, = 0-6745 x 16:5 УЛ — 0:98: = 
4:08. But if the reliability coefficient is only + 0:70 
Р.Е. = 7-94, that is almost twice as large. Sucha P.E. 
15, as we have already seen, an index of the extent to which 
scores on the second test may vary when predicted from 
scores бп the first test. About half the testees will obtain 
1.0.5 on the second test within 4 points (if r = 0:98), ог 
„about 8 points (if r = + 0-70), of their I.Q.s on the first 
test. But some “may alter by as much as 44 x P.E. ` 
ie. by 18 and 36 points respectively. 

The application of two parallel forms of a test to the same 
individual yields, as we have seen, somewhat different 
results. If we possessed a large number of additional 
parallel forms and applied them, still other results would 
bé obtained (quite apart from effects of practice), Such 
results for one individual would tend to conform to 2 
normal distribution, and their grand average would, of 
course, be the best possible measure of the individual's 
standing on the test—what may be called his truc score. 
The situation is exactly analogous to that which we dis- 
cussed in connection with group differences (Chap. V), 
except that there we were concerned with variations among 
the means of sample groups, here with variations among an 
‚ individual's scores on sample tests. Ап individual's score 
therefore possesses another P.E., which indicates its 
liability to deviate from his true score. The formula for 
this also involves the reliability of the test, and the S.D. 
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of the test scores of an unselected group of persons. 
P.E. = 0-0745c УЛ — r. Thus the P.E.s of I.Q.s for tests 
with reliability coefficients of + 0:98 and +.0-70 are 2:96 
and 6-11 respectively. Note that these P.E.s are smaller 
than those quoted above, since МТ — т is necessarily 
smaller than VI — ғ. This is to be expected, since the 
former P.E. represents the deviation of one sample score 
from another sample, whereas the latter represents the 
deviation from the mean of all the possible samples. 
Factors Influencing Test Reliability.—Since the relia- 
bility of a test is almost synonymous with its thoroughness, 
it сап be increased or lowered almost indefinitely by 
lengthening or shortening the test. The formula for pre- 
Фе пр the reliability of a test whose length is doubled has 
already been cited. When it is lengthened n times, the 


formule : LS «This. ом 
a becomes : в =з уй ) 


as the Spearman-Brown prophecy formula. 


nal test has a reliability coefficient 
e lengthened to attain a co- 
e formula round, we 


Ex. 51.—A short educational 
ES + 0-60. How much must it b 
efficient of + 0:90 ? Turning the abov! 
get: 
на — r) _ 0:90 x 0-40 _ 


n= а СЕ) 090x010 ~ 


It should be six times as long. | 
Тһе same calculations would apply to the following example. 
f two examiners mark a set of examination scripts, and their 


marks inter-correlate + 0:60, how many examiners should mark 
the scripts for their combined marking to attain a reasonable 
reliability of + 0-90 7 The answer is six. 

hat the correlations between the 
additional tests (or examiners) would be the same as be- 
tween the first pair of tests (or examiners). This condition 
Seems generally to be satisfied by educational measure- 


ments, 
А. second importa t 
group whose scores prov! 


Тће formula assumes t 


nt factor is the heterogeneity of the 
de the reliability coefficient. 
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Parallel forms of an intelligence test will inter-correlate 
much more highly if applied to school-children of several 
years' age range than when applied to a single class or to 
a single year-group (i.e. children whose ages range over 
only one year). Coefficients obtained from the latter 
groups are conventionally accepted as fairer indices of 
reliability than coefficients from the former. If we know 
_ 9, the S.D. of test scores in a group which is unduly homo- 
geneous or heterogeneous, then it is possible to predict E, 
the reliability to be expected in a group with S.D. £, by 
c? 
У: 


Kelly’s formula: R =1 — (1 — r). 

Ex. 52.—The reliability coefficient of an intelligence test 
when applied to a class of primary school-children was 4- 0:88. 
The S.D. of their LQ.s was 14. What would be its reliability 
ina secondary school class whose LQ.s have а S.D. of only 9 ? 

А В =1 lË x 012 = фот. 

It can easily he shown algebraically that the Р.Е. of & 
score is not upset by this factor as is the reliability co- 
efficient. In the above example the two P.E.s would be 
0:6745 x 14 V/0-12, and 0-6745 x 9 У0:29, both of which 
— 9:27. Nor is the P.E. of the score on one test as 
estimated “rom the score on another much affected (this 
point was mentioned on p. 141). In our example, 0:67450 
VI — r? = 448 and 4-28 respectively. 

Function Fluctuation.—Most reliability coefficients fall 
below 1-0 for two distinct reasons, first through the test 
being insufficiently thorough (i.e. liable to errors of sam- 
pling), and secondly because the persons tested or examined 
may alter in their level of achievement between one test 
and another. The latter factor has been termed ‘ indivi- 
dual variance,’ or, ‘ function fluctuation ’ (Thouless, 1986). 
Naturally it has a greater effect on coefficients obtained 
from parallel tests given on different occasions, or from 
repeated tests, than it does on coefficients obtained by the 
split-half method. Hence the difference between these two 
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types of coefficient affords а means of distinguishing 
between the reliability of the test itself, and the tendency 
of testees to fluctuate in respect of the ability tested. 
Thouless provides a formula for estimating function fluctua- 
tion from this difference. 

Effects of Test Reliability on Test Inter-correlations —A 
test which is not perfectly reliable may be regarded as 
measuring so much of the function which a perfect test 
would measure, and so much error. The error part of it 15 
quite useless, and will have the effect of reducing the 
correlation between this test and any other test. If both 
tests possess considerable errors through lack of reliability, 
their inter-correlation will be considerably reduced. In 
fact, the maximum possible value of such an inter-correla- 
tion will be, not 1-0, but the square rovt of the product of 
their reliability coefficients. Suppose we find a correlation 
of + 0:50 between special place examinations and subse- 
quent secondary school achievement, and that the relia-.. 
bilities of the examinations and the measure of achievement 
are only + 0-70 and + 0:80 (a common state of affairs). 
Then the true correlation, if it were possible to obtain 
perfectly reliable exams and measures of achievement, 


+ 0:50 
wi — = 0:67. 
Onin ђе 0-70 x 0-80 "i 


Spearman has called this reduction effect of imperfect 
reliability—' attenuation,’ and has provided formule Ly 
which it may be corrected or compensated. Such correc- 
tion is, of course, of theoretical interest, since it tells us 
what correlations to expect between various psychological 
abilities if they could be accurately measured. But it is of 
little practical value, since completely accurate tests are 
unattainable, and for all purposes of prediction we have to 
employ the obtained, uncorrected, correlation coefficients 


(cf. Thouless, 19892). 
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CHAPTER VIII 
ANALYSIS OF ABILITIES 


Hallacious Views of Mental Organization.—Correlational 
investigations have assumed especial importance in recent 
years, since they enable us to clarify our conceptions of 
human abilities, and of the organization or structure of the 
mind. The layman’s notions as to what abilities exist, and 
what cach ability includes or excludes, are extremely loose, 
and the same was trae of psychological theory throughout 
most oflast century. "Teachers and parents are heard to 
remark." * Johnny is poor at book-learning, but he makes 
up for it by his cleverness with his hands. He will be a 
good mechanic when he grows up.” “ Магу has an 
excellent memory and good power of attention.” ‘ Willie 
is slow but sure,” and so on. Such statements are now 
known, as a result of experimental investigation, to be 
largely fallacious. Only exact study сап tell us how far 
the slow person is likely to be sure, whether manual ability 
in a boy has ‘any predictive value for mechanical ability in 
а man, whether there is any such entity in the mind as 
memory. Again, the so-called faculty school of psycho- 
logists of a hundred years ago considered the mind to be 
made up of a number of distinct powers or faculties such as 
reasoning, judgment, imagination, ete. Indeed, it was at 
one time regarded as the main object of education to train 
each of these faculties by providing children with appro- 
priate ‘ mental gymnastics.’ As soon, however, as scientific 
psychologists tried to define the faculties precisely and 
‘measure them, and to determine the effects upon them of 
various types of training, they fell into discredit. A priori 
analysis has not even been able to decide what is the true 
nature of intelligence. The accounts of it given by 
150 
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different writers have been extremely varied and discrepant. 
Nowadays, therefore, such problems, both of theory and 
practice, are approached by means of experimental studies 
of correlations between tests. For they all eventually 
resolve into the question—what correlates with what ? 
The Scientific Definition of An Ability.—First we must 
define what we mean by an ability, capacity, or faculty. 
It implies the existence of a group or category of per- 
formances which correlate highly with one another, and 
which are relatively distinct from (i.e. give low correlations 
with) other performances. Take, for example, mechanical 
ability. Some people are better than others at tasks 
involving manipulation of mechanisms, and this ability is 
fairly consistent or general in the sense that those who are 
good at one such task are also usually‘good at others. The 
Consistency is not perfect. A may be especially good with 
meccano, less clever with locks, B the opposite, C *he best 
with electrical fittings, and so on. But if there was no 
appreciable correlation between these and other Similar 
Performances, if they were all found to be specific, we 
Should not be entitled to regard mechanical ability as a real 
entity. Another important condition is that the per- 
formances should be fairly reliable. We do not expect 
perfect stability, but if people fluctuated wildly in their suc- 
cess at mechanical tasks from day to day, we should hardly 
recognize the existence of mechanical ability. There is a 
further possibility, namely, that the various performances 
may inter-correlate well, but that the correlations may be 
accounted for by some other common factor such as general 
intelligence (the age factor, we will assure, has already 
been eliminated). Mechanical ability would not then be 
anything distinctive. However, intelligence tests can be 
applied to the same persons-who take the mechanical tests, 
and intelligence can be partialed out or held constant. It 
is then found that the mechanical tests still overlap, or 
Show positive correlations with one another over and above 
their correlations due to the intelligence factor (cf. Cox, 
1984) We are, therefore, able to accept mechanical 
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ability as something consistent and distinctive. Tests of 
it do correlate sufficiently well and reliably for us to postu- 
late some common element running through them. 

À common element such as this is often referred to аза 
group factor, since it occurs in а group of performances of а 
certain restricted type. It differs from a general factor like 
intelligence, which is found to run through an extremely 
wide range of tests, which indeed enters to some extent into 
all abilities. 

Since the consistency or overlapping of different mech- 
anical tests is not perfect, no one test can give a really 
adequate measurement of mechanical ability. But by 
combining several tests, we are likely to get a result much 
more representative of the group factor as a whole. The 
same applies, of course, in the measurement of educational 
abilities. We do not expect long-division sums alone to 
tell us how good pupils are at arithmetic, although they 
—provide some indication of the ability. Instead we set 
several types of sums, and try to cover the whole field of 
arithmetic which the pupils are supposed to know. 

Arithmetic is likely to yield a group factor similar to the 
mechanical ability factor. But many of the faculties 
assumed: by psychologists in the past, or by laymen at the 
present time, fail to stand up to the criteria enumerated 
above. Memory, for example, refers to so many different 
thirgs, that it is meaningless to talk of so-and-so as having 
а. good or bad memory in general, or of ‘ training children’s 
memories.’ The speeds with which people learn various 
sorts of material, and their retentiveness (i.e. the amounts 
of such material which they can recall after an interval), 
give only moderate or low inter-correlations, especially 
when the influence of intelligence is removed. Probably 
there exist several small group factors, each representing 
memory for a certain narrow range of material) In other 

1 А summary of the evidence on this and other types of ability, 


obtained between 1940 and 1950, is given in the writer's book, The 
Structure of Human Abilities (London: Methuen, 1950). 
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words, there may be several different types, but no single 
entity, of memory. 

The notion that mental speed differs from, and is usually 
opposed to, mental accuracy or power, also appears to be 
fallacious. If there were two such distinct abilities, then 
tests of speed should give high correlations with one 
another, and low or negative correlations with tests of 
accuracy or power. In actual fact speed and power tests 
all inter-correlate to nearly the same extent, showing that 
they are both components of one and the same mental 
factor. It follows, then, that those whoare ‘slow’ are more 
likely to be ‘ unsure ? than © sure,’ at least so far as mental 
work is concerned. 

All Mental Abilities are Positively Inter-correlated —Let 
us now turn to the wider problem—what abilities does the 
mind contain, and how are they organized or inter-related ? 
A first, extremely important, fact is that all teste ог mental 
abilities tend to give positive inter-correlations. Even 
tests of manual, physical, and other non-intellectuz! 
functions also usually correlate positively with one 
another and with mental tests, though the coefficients are 
often very small, whereas the coefficients among tests of 
intellectual functions are generally moderate to high. 
Occasional negative correlations may arise. but they are 
not likely to be statistically significant, and so may be 
ascribed to errors of sampling. In other words, talent and 
versatility, more often than not, go together ; a person 
outstandingly good or bad in one field is usually above or 
below average, respectively, in all other fields. This fact 
at once serves to cast a great deal of doubt on the ‘ com- 
pensation theory,’ i.e. the view that those who are poor at 
one thing make up for it by being good at something else. 
Tt does not entirely disprove it, for the following reasons. 

First the reader should study the scattergram fora very | 
low positive correlation, such as that shown in Fig. 26. 
This compares the marks of students in an examination in 
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hygiene with their intelligence-test scores, and yields an 
T of + 0:146. It will be seen that, though there is still a 
slight tendency for high scores on X; to be associated with 
high Scores оп Х,, yet the correlation admits of many big 
exceptions. Students who are among the highest 8% for 


INTELLIGENCE TEST (X2) 


НУС!ЕКЕ MARKS (Xj) 


8 28' 42-.52 .66. 57 44 29 10 556 


Fic. 26.—Scattergram and regression lines illustrating а low correlation, 
| т = + 0-146. 


intelligence may range right down to the lowest 18%, for. 
hygiene. The large amount of regression means that а 
fair proportion of persons can be high on one variable, low 
on the other. Such a state of affairs is certainly much 
more likely to occur here than it is when the correlation is 
high (as in Fig. 28). 

Now * book-learning ? and.‘ being clever with the hands * 
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are not highly correlated, hence it is quite possible for some 
children to be much better at one than at the other. But 
it is definitely false to suppose that they are inversely 
related, that those who are bad at one are therefore good 
at the other. On the contrary, those who are superior in 
intellectual abilities are on the average superior in practical 
ones also, though to a lesser extent. 

А second reason is that age differences often obscure the 
issue. Some teachers are incredulous when told that chii- 
dren of superior intelligence tend to be superior also in 
growth and physical capacities to dull children, since they 
know well that the dullards in an ordinary school class are 
usually larger and stronger than the bright youngsters. 
But, then, they are forgetting that the dullards are far older 
than the bright ones, and so may bó physically superior. 
If the dull children were compared with bright ones of the 
same age, the physical superiority of the latter would at 
опсе be obvious. Nevertheless, in this instance also the 
correlations are so low that a great many exceptions are 
possible. 

Thirdly, ability is obviously to some extent a matter of 
interest and practice. Pupils who are backward at intel- 
lectual studies and who are on the average backward, but 
much less so, in physical and manual pursuits, naturally 
devote more time and energy to such pursuits. Bright 
pupils, though initially stronger and better at practical 
Matters, may be more interested in intellectual matters, 
and so fall behind in other things. Itis possible, therefore, 
for the differences that exist between abilities which give 
low inter-correlations to become accentuated, though there 
is still no evidence known to the writer that the correlations 

е. 
ме as Mx then, that all the abilities with which 
teachers are concerned are, to a greater or Jesser extent, 
positively correlated. As an example, we will quote some 
of Burt’s (1917) results. He gave thirteen objective tests 
of achievement in various school subjects to 120 children 
aged 11 +, and calculated the inter-correlations. The 
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following table shows the average correlation between each 
test and the other twelve tests. 


Composition & i . 4 0:505 
History . s У Р + 0:448 
Сеортарћу * i Р · + 0-456 
Nature Study . s а - + 0:458 
Mechanical Arithmetic А . + 0:288 
Arithmetic Problems . Б . + 0-457 
ө Reading Fluency s s . + 0:323 
Reading Comprehension ` · + 0:369 
Writing Speed . " А · + 0:823 
Writing Quality А $ . 0274 
Dictation . 2 А s . + 0:825 
Drawing . s А s . +0275 
Handwork Р 5 А + . + 0:820 


Composition, history, geography, nature study, and 
arithmetic problems overlap with the other tests to the 
largest extent. Mechanical arithmetic, drawing, and writ- 
ing quality correlate to the smallest extent. This shows 
that there is a tendency for a pupil's achievements in all 
subjects to run on a fairly even level, though the correla- 
tions are low enough for some pupils to exhibit considerable 
divergences—special strengths in some subjects, special 
weaknesses in others. During the secondary school stage, 
correlations ‘between subjects decrease, partly, of course, 
because the pupils form a more homogeneous group (ef. 
pp. 140-1), but also because abilities seem to differentiate 
during adolescence, and specialized talents develop. 
Nevertheless, the positive overlap of all intellectualabilities 
continues even, beyond the university level. For example, 
the present writer inter-correlated the marks of 300 
graduate students at a training college, and obtained small 
positive coefficients between subjects as widely different 
as teaching skill, arithmetic, hygiene, education, psycho- 


1 The average reliability of the tests was + 0-738. All these co- 
efficients enh, therefore, be larger by roughly 35% if the tests had 
been perfectly reliable (cf. p. 149). А recent repetition of the in- 
vestigation with a much larger number of children has yielded closely 
Similar results (Burt, 1939a). 


| 


ANALYSIS OF ABILITIES 157 


logy, speech training, singing, physical training, etc. (cf. 
Vernon, 1989). 

These facts are of considerable importance to examiners. 
They indicate that г School Certificate or Leaving Certi- 
ficate, achieved at the age of 15 to 18, is some criterion of 
aptitude for any intellectual occupation, or for a university 
career, albeit not а very good one. Similarly, when the 
Civil Service requires persons to fill a wide variety of posts, 
it selects them, not by trying to test the aptitudes needed 
for each post, but by a general examination based largely 
on subjects studied at school or university. It believes 
that the possession of ' brains’ for one subject implies 
aptitude for all other subjects. And although, as we shall 
see later, the efficiency of these methods of selection leaves 
much to be desired, yet they are to this extent sound. 

The General Factor Responsible for Positive Overlap.— 
Now when we find positive association, as in Burt's ir vesti- 
gation, we are justified in assuming the existence of a 
general factor—general educational ability —which runs 
through all the separate subjects. Some subjects may be 
said to be more ‘ saturated ? with it than others, since they 
are more highly correlated with it. In the primary school 
we should get an excellent indication of this general factor 
from pupils’ marks in composition and in arithmetic 
problems, a much poorer indication of it from drawing or 
handwriting quality. Statistical methods are available for 
assessing a pupil's standing on this general factor from his 
marks on all the component subjects (cf. p. 148). Further, 
it is quite easy to partial out the factor, and so to see 
Whether or not it accounts for all the correlations between 
yhether, in other words, it is the 
only common element involved. When this is done, it is 
found that there are still а number of correlations over 
and above those due to the general factor. In Burt's 
study, the residual correlations indicated considerable over- 
lapping within the following four groups of tests: (1) 
arithmetic problems and mechanical arithmetie ; (2) hand- 
work, drawing, writing speed and quality ; (3) dictation, 


the separate subjects, 
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reading comprehension and Speed ; (4) composition, his- 
tory, geography, and nature study. But there was little or 
no overlap, or even negative correlations, between the tests 
in one of these groups and those in another group. Such 
results show clearly the existence of group factors, i.e. of 
distinctive types of educational abilities, over and above 
general ability. In this instance there is an arithmetical 
factor, a manual one, а linguistie one, and one including all 
the higher, more integrative, school subjects. As Burt 
points out, we have here a basis for cross-classification of 
Pupils. He suggests that they should be assigned to school 
classes in accordance with their achievements in the 
linguistic and * integrative ? subjects, and that they might 
advantageously be reclassified for arithmetic lessons, and 
again for manual subjects. 

Analogous findings were obtained in the writer’s study 
of training college marks. Three separate ‘ families ' of 
subjects could be distinguished, a practical one (teaching 
_ Skill, speech training, and physical training), a scientific 
one (psychology, hygiene, arithmetic), and a literary опе 
(English, speech training, education, history). Few re- 
searches along these lines have been carried out in the 
industrial sphere, owing to the difficulties of measuring and 
correlating different vocational abilities. But we might 
expect to get from them similar results, for example, group 
factors for the engineering type of occupation, for the sales- 
manship type, the clerical type, and so on. 


FACTORIAL ANALYSIS 


The statistical investigations of Spearman (1927) and 
others have shown that it is possible to account for prac- 
tically the whole of a set of test inter-correlations by 


postulating appropriate common factors.) It is legitimate, 


1 Note that we do not claim to be able to account for the whole 
of the correlations, for such correlations are, of course, always liable 
to errors of sampling. Often the residual coefficients, which are left 
after a general factor and some group factors have been extracted, 
are so small relative to their P.E.s, that it is not worth while analys- 
ing them further. Even with the most accurate techniques, such as 
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then, to regard each test as made up of two components, or 
as measuring two things. In so far as it correlates with 
other tests, it is measuring some factor or factors. This 
is often referred to as the test's communality. But in so 
far as it fails to correlate (and most test inter-correlations, 
be it remembered, are only moderately high), it is measur- 
mg a purely specific component, or something which is 
peculiar to that test alone and has no relationship to any 
other test. This is called the test’s specificity. It in- 
cludes, of course, the error which is present in the test on 
account of imperfect reliability * (cf. p. 149). Thus any 
test of an educational or vocational ability can be analysed 
into certain proportions of certain factors. For example, 
in Burt's research, the arithmetic problems test can be 
analysed into so much general educatioral ability, so much 
` arithmetical group factor (these together make up its 
communality), and thirdly, à component specific ta that 
particular test. The mechanical arithmetic test can be 
analysed into a rather smaller proportion of general factor, 
a large proportion of the arithmetie group factor, and а 
Separate specific component. The presence of these 
specifie components reduces the correlation between the 
two tests, but it is still quite high, namely + 0-76, because 


they have both a general and a group factor in common. 


A test of another subject, such as handwork, stiil correlates 
positively with arithmetic problems, because they are both 
partially saturated with the general factor, but the co- 
efficient is a small one, namely -+ 0:38, since handwork 
embodies a different (manual) group factor, as well as a 
different specific component. M d 

By means of factor analysis, then, it is possible to reduce 
а large number of test results to a few underlying common 
faetors. Any psychological test, or measure of a scholastic 


Hotelling's and Burt 
little significance can be attached to them: i 

1 Thomson (1939) points out, however, that some of the influences 
producin, ‚ unreliability, e.g. those of the function fluctuation type, 
may На be common to several tests, and so contribute to their 


communality. 


s, the later factors are generally so small that 
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or occupational ability, can be regarded as made up of 
some of these factors, or as possessing a certain ‘ factor 
pattern.’ Further, an individual person’s standing on each 
factor can be derived from his test scores, and he, too, pos- 
Sesses a certain factor pattern. Thus, a рирїї who is high 
in general educational factor will be above average in most 
of his school work, but his relative goodness at different 
subjects will depend on the strength or weakness of his 
group and specific factors. 

Dangers of Factor Analysis.—At this point a strong 
warning should be given against carrying the factorial con- 
ception of abilities too far. Factors are not entities in the 
mind whose nature or constitution, and whose strength or 
weakness, are immutably fixed. Some writers do seem to 
assume that factorial analysis is revealing the fundamental 
elements of which human minds are compounded, much 
as chemical analysis reveals the elements from which 
chemical substances are compounded. We should realize 

‚ that factors consist primarily of categories for classifying 
mental tests and' examinations. Their value lies in the 
fact that they tell us objectively what correlates or over- 
laps with what, and so they take the place of unverified 
faculties and other subjective conceptions.of human traits 
and abilities. 

It is obviously illegitimate to compare factors in the 
educational field with chemicalelements. For these factors, 
which are deduced from the correlations between various 
school subjects, must depend largely on how the subjects 
are taught and on the stage which the pupils have reached. 
Different factor patterns will, therefore, be found in different 
school classes, according as the teachers connect up, от 
bring about overlap in their pupils’ minds between, the 
subjects. For example, Oldham (1937-88) found very 
diverse relationships between arithmetic, algebra, and 
geometry among different 12- to 13-year-old school classes, 
which could ђе ascribed to differences in teaching methods 
and in the pupils’ attitudes towards the subjects. Others 
have proved that the factors extracted from a set of mental 
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tests alter progressively with practice at the tests (e.g. 
Anastasi, 1936). 

Like all other statistical constructs, factors will be liable 
to a good deal of variation if they are derived from measures 
of small numbers of persons. But apart from such samp- 
ling errors, they must be regarded as depending to a con- 
siderable extent: on the particular group of persons 
measured. Thomson (1939) shows that they are markedly 
affected by the heterogeneity or homogeneity of this group 
of persons. In addition, they are dependent upon the par- 
ticular set of tests used, since the factor pattern of a test is 
in essence an expression of the relations between this test 
and all the other tests in the set. Statistical analysis of a 
Single test in isolation is incapable of telling us what factors 
it contains (though subjective analysis, or inspection of its 
content, may often give useful hints as to its probable 
factor pattern, which can later be verified by studying 
objectively its correlations with other tests). However, 
this limitation may not be very serious, since, as we shall 
see below, at least some of the most important factors 
emerge in almost identical form from a wide variety of 
Sets of tests. 

Then there is the more subtle difficulty that the same set 
ОЁ tests can always be factorized in a great many different 
Ways. Most of these ways will lead to quite illogical 
results, so that the investigator has to choose from among 
rizations the most meaningful 
Take, for example, the educa- 
tional tests applied by Burt. We have so far regarded 
these as analysable into general educationaleability, four 
group factors for related subjects, and separate specific 
factors for each test. Now Burt showed, incidentally, that 
the general factor was not identical with general intelli- 
gence as estimated by teachers, or as measured by intelli- 
gence tests, though the two overlapped closely. But it 
would have been quite legitimate to analyse measures of 
intelligence along with the educational measures, and to 
partial out intelligence first, so making it the principal 


the various possible facto: 
and convenient pattern. 
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factor underlying achievement. If this had been done, 
it is likely that a subsidiary general factor would have 
emerged representing, perhaps, the pupils’ application to, 
and interest in, their work in general. Intelligence plus 
this subsidiary factor would then make up what we have 
previously denoted as the general educational ability factor. 
The group factors would also need to be somewhat modi- 
fied, since most of the fourth one (composition, history, 
ete.) might have been absorbed into, or accounted for by; 
the intelligence factor.: 

In investigating students? college marks, the writer 
obtained analogous results. The inter-correlations could 
be accounted for either by a very prominent general factor 
and group factors restricted to а few subjects, or else by 
three factors, all cf about equal prominence, each entering 
into several subjects. 


Certain other precautions which should be observed in 


interpreting the results of factorial analyses will be men- 
tioned later. 


Types of Factcr Pattern.—Three main types of factor 


I. Br-racror PATTERN 


p xr | 

Test | General Group Factors Specific 

Factor А в с p | Factors 
1 x х х 
2 х X x 
8 х х х 
4 х х х 
5 SG x Ed 
в. Бе e| x x 
7 5< х x 
8 x x x 
9 x x х 
10 х х x 
11 х x х 
12 x x x 


i f 

1 [n his later study (1939a), Burt describes several methods о: 

factorizing which do yield numerically different, though logically 
consistent, results. 
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pattern recur very frequently in educational and psycholo- 
gical investigations. They may be represented by the 
following diagrams, each of which refers to twelve hypo- 
thetieal tests, Nos. 1, 2, . . . 


II. MurrIPLE-FACTOR PATTERN 


| Test Common Factors | Specific 

| 81 

| s x B с р Factors 

| 

| Llx х x 
2 x x x x 

| 8 ү ж ДО 
4 xc | ж ж х 
5 х х x 
6 Se x 
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atte OX x х è 
10 x x 
11 x x x 
12 x x x 


III. UNI-FACTOR PATTERN 
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In the first, which H 
pattern, there is à gener 


olzinger (1987) calls the bi-factor 


al factor running through all the 
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tests, four different group factors, and specifics. This type 
of pattern is usually adequate for representing the results 
of tests of abilities, at least among children. For instance, 
it applies excellently to Burt's investigation. 

The second, multiple-factor, pattern requires three, four, 
or more factors, none of them general, but all running 
through several tests, to account for the test inter-correla- 
tions. It will be seen that Test 1 involves factors A and С 
and a specific ; that some of the tests involve three, some 
only one factor. This type is usually required where tests 
of personality, character, or temperament, are concerned, 
since in these fields a general trait, running through all the 
tests, would seldom be appropriate. For example, Thur- 
stone (1981) obtained measures of the interests of a group 
of students in a large number of different occupations— 
advertising, art, chemistry, etc.—and found that the 
interest scores could be classified under four main factors. 
These he roughly identified as interest in language, e 
Science, in business, and in people. Each separate ые 
could be made up of appropriate proportions of these on 
factors. Other instances have been outlined by the presen 
writer elsewhere (Vernon, 19386). . | h 

Spearman’s Two-factor Theory.—The third, чин 
factor, pattern, has aroused and still continues to pee 
а great deal of controversy. Historically it was the firs 
tvpe to be discovered. Early in the present Em 
Spearman found that diverse tests of mental abili P 
usually gave inter-correlations which could ђе и ) 
accounted for (within the limits of their errors of samp "Be 
by a single general factor plus specific factors. н, 
enunciated certain criteria to which the correlations mu £ 
conform if this is to be possible, the most important 9 
which is called the tetrad difference criterion.! The genera 

bir), Bua аата Beet n etas кана ВА 
eda Pe nitas Аоба Pattern—is merely due to the fact that 


а һе 
rman i specific components as factors, whereas t 
(deem msc did only to the communality of the factorize: 


tests. 
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factor obviously corresponds to what we commonly mean 
by general intelligence, but Spearman preferred to sym- 
bolize it by the letter g, the specific factors by letters s, 
Sx. +. ., etc., so as to get over the difficulty, mentioned 
above, that the term ‘intelligence ’ has been so diversely 
defined by different psychologists. His g factor, it was 
claimed, would be the same whatever the tests used. It 
could be measured, for example, by a battery of tests con- 
sisting of Analogies, Completion, Vocabulary, and Direc- 
tions, or equally well by a battery consisting of Abstrac- 
tion, Mixed Sentences, Opposites, and Reasoning Problems. 
Each test would have its own independent s, but in so far 
as it overlapped with other tests, it would be measuring g. 
This view is widely known as the Two-factor Theory. 

For a while it was thought that the?Two-factor Theory 
would apply to all tests of abilities, not only to intellectual 
ones, and that very few additional group factors would be 
needed. Only occasionally would a few tests which were 
very similar in content (e.g. tests of musical aptitudes) be 
found to inter-correlate over and above their correlation 
due to g. 

Different tests may differ considerably in their g-satura- 
tions. Those with the highest saturation seem to involve 
the capacity for seeing relationships, whilst tests which 
involve mainly mechanical or rote memory contain very 
little g. Among school subjects, classics is most dependent 
on g; French, English, history and the like are also highly 
saturated. Manual subjects and music have much more 
prominent s's and relatively little g. Thus we find high 
correlations between classics and English, lów correlations 
between handwork and music, and moderate correlations 
between classics and handwork, or English and music. 
The Two-factor Theory not only explains these and similar 
facts as to the relations between abilities, but also provides 
the mental tester with a scientific basis for choosing tests 
which will give the best measures of g. Although every 
sub-test in a battery of intelligence tests will involve a 
certain s-factor, tests сап be selected in which these s's 


12° 
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are small, and when the tests are combined the s's will tend 
to cancel out, so leaving an almost pure measure of 8. 
Criticisms of the T wo-factor Theory.—The theory has 
been subjected to much criticism, both on psychological 
and on statistical grounds. Spearman suggested that g 
might correspond to the conception of ‘general mental 
energy.’ Whether or not this is a psychologically sound 
explanation need not concern us here. On the statistical 
side, Thomson (1935, 1939) and others pointed out that, 
although the analysis into general and specific factors is 
legitimate when the tetrad difference criterion is satisfied, 
yet this is not the only possible mode of analysis ; that if 
mental abilities consisted of large numbers of overlapping 
group factors, the same criterion would still be satisfied. 
Thomson’s view would seem to the present writer to be 
preferable from the psychological standpoint, but Spear- 
man’s view is perhaps more practically convenient, since 
we do already in everyday life assume the existence of 
something like g—whether we call it ‘ brains,’ ‘ brightness, 
‘intelligence,’ or what not—and the Two-factor Theory 
provides a scientific basis for this common-sense assumption. 
The theory, and the experiments which it has stimulated, 
at one time seemed to discredit all the mental faculties 
which had so often been abused in traditional psychology 
and in common parlance. But more recent research with 
mc ern factorial methods indicates that the theory requires 
considerable modifications in the matter of group factors. 
The evidence is still very far from complete, but it suggests 
that some at least of the faculties may re-emerge as 
scientifically established group—or multiple factors. For 
example, Thurstone (1938) applied a wide variety of psycho- 
logical tests to university students, and obtained results 
which could be resolved into a set of multiple factors. 
These he identified as number facility, word fluency, 
visualization, memory, reasoning, perceptual speed, and 
induction. Spearman (1939), however, showed that a 
pattern of general + group factors would fit the same 
results as well, or better. Again, Guilford (1986) found 
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that the sub-tests of а group test designed to measure adult 
general intelligence could be analysed into three multiple 
factors: verbal ability, numerical ability, and ability to 
use simple relationships. 

It seems fairly safe to conclude now that no test measures 
nothing but g and a specific factor, since the type of test 
material employed always introduces some additional 
common element. АП verbal intelligence tests, together 
with other tests depending on manipulations of words, in- 
Volve a verbal factor, which has been named v. Certainly 
Some, and possibly all, the common non-verbal perform- 
апсе tests bring in a practical factor, which Alexander 
(1985) calls F. In order to escape the influence of v, many 
тпеп{а] testers are constructing tests out of abstract 
diagrams, pictures, andothernon-verba)material. Zxperi- 
ments suggest, however, the existence of a factor, called 
Ё, which enters into such tests involving spatial relation- 
Ships (cf. El Koussy, 1935) This overlaps, or may be 
identical with, F. School examinations and educational 
achievement tests depend (as Burt's research showed) on a 
factor which possibly represents the pupils’ interest in and 
attitudes to work, or their character and temperament 
qualities, In addition there are probably group factors for 
families of school subjects, largely governed, as we have 
Seen, by teaching methods. The main types—literary or 
humanistie, scientific and mathematical, practical and 
Manual—are to be found not only among the abilities of 
School-children, but also among those of adults such as 
Student teachers (cf. Vernon, 1939). In the vocational 
field we may expect to discover a similar differentiation, 
though the only group factors which have been at all fully 
investigated so far are those for mechanical ability (m- 
factor), and for routine manual operations (cf. Cox, 1934). 
In later adolescence and adulthood, however, the situation 
becomes very obscure. А g-factor can still be extracted 
from mental tests, but it scems to play a decreasingly im- 
portant part both in educational and in vocational achieve- 
ment. Abilities become so diversified, and interests, work 
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attitudes, and temperament traits are so influential, that 
we can hardly hope to establish any simple scheme of 
mental organization. 

: Educational and Vocational Applications.—Such educa- 
tional and vocational findings are of great practical im- 
portance, for if the Two-factor Theory were exclusively 
accepted, educational and vocational guidance by means 
of tests would scarcely be possible. If we wished to advise 
a child as to the type of education or the occupation for 
which he would be most suited, we could apply tests of g 
and so deduce the general intellectual level of the work 
he should take up. But we could: not decide whether he 
would be likely to do better at classics or science, at а 
clerical job or an engineering trade, and so on, except by 
giving & vast number of tests of different s factors, or by 
letting him try each type of work in turn so as to find by 
practica! experience which specific one suited him. There 
could be no recommending of general lines of occupation. 
„То a certain extent this does seem to be true. Educational 
and vocational guidance among adolescents are extremely 
difficult and uncertain. They have to be based more upon 
a subjective analysis of interests and character traits than 
upon objective predictions by tests. Yet common experi 
ence strongly suggests that there do exist general cate- 
gories or types of work, and that therefore a number of 
group factors, additional to m, could be established by 
further research. If this were done, guidance would 
become much more scientific. Full consideration of а 
candidate’s interests and character would, of course, still 
be required, but his scores on a series of tests for each of 
these factors would give objective indications of his prob- 
able ability along each of the main lines of work.: Some 


1 Thomson (1989) has recently pointed out that when tests are 
given with the object of predicting educational or vocational achieve- 
ment, the extraction of factors from the tests may constitute quite 
an unnecessary intermediate step. A much more efficient méthod 

- would be to correlate the tests directly with measures of achievement 
and then to use multiple correlation for obtaining the most accurate 
predictions. While this view is, of course, perfectly true from the 
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overlap might also be found between the group factors in 
school work and those in different types of occupation, so 
that outstandingly good mathematicians, linguists, etc., 
might be directed to occupations which involve these 
abilities. 

Conclusions about Factor Analysis.—Clearly an enormous 
amount of unavoidably complicated and expensive research 
will be needed before the whole realm of human abilities 
can, as it were, be mapped out. The difficulties are en- 
hanced because, as we have seen, the factor patterns which 
an investigator extracts are often dependent on the par- 
ticular tests and particular testees he employs, and because 
there is much room for subjective judgment in the type of 
factor pattern he chooses. Research at present is rather 
badly co-ordinated, since each investigator sets ouc to ex- 
plore fresh territory, instead of trying to build on the factors 
already fairly well established by previous workers. 
Further, there is always the danger of mathematical rather 
than psychological considerations getting the upper hand. 
As already mentioned, factorial investigations are not те- 
vealing the elements out of which the mind is compounded, 
but are classifying test performances with a view to pre- 
dieting more effectively other performances, e.g. in the 
educational or vocational sphere. But by no means all of 
the important facets of human nature are susceptible yet 
of accurate measurement. There may be abilities such as 
executive capacity, intuitive power, goodness of teaching, 
and social intelligence, which are generally recognized in 
isti i i re: write 
saa standpoint, pad em a eT i 
not in guidance. It should certainly be used by the tester who 
wishes to select the best candidates for some single and definite 
educational course or job. But the problems of the vocational 
adviser are much too complex to be tackled as yet by this exact 
method. He is forced to predict a candidate's probable achievement 
in generalized terms, or factors. Aet E from subjective con- 
Ыы teme of m EDS The view put forward above 


siderations of tests and lines 0 t | abo 
is that his procedure could be made more systematic and scientific 
if more factors were experimentally established. 
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daily life, but which are not yet properly substantiated or 
analysed owing to the inadequacies of our tests. Still 
more is this true in the realm of emotional traits and in- 
terests, for reasons which the writer has described else- 
where (Vernon, 1938b). Yet we cannot hope to make suc- 
cessful predictions about an individual child or adult unless 
we know every side of his make-up. It may well be that 
all these attempts at analysing the measurable abilities and 
traits of human beings will never give us a picture of an 
individual personality considered as a whole, that we shall 
always need to supplement test results with subjective 
judgment and intuition in order to resynthesize them into 
a living entity. But this, too, is a matter of psychological 
controversy which the reader may follow ир elsewhere 
(e.g. Vernon, 1987а b ; Cattell, 1937). 

We have no space to describe the actual statistical 
techniques of factor analysis. Spearman has fully de- 
scribed his method in The Abilities of Man (1927). It is 
not difficult to apply, though rather laborious. It is the 
most accurate methed of extractin g unitary-factor patterns, 
but it is hardly suitable for bi- and multiple-factor patterns. 
Holzinger’s (1937) method is an extension which deals 
readily with bi-factor patterns, though it does not seem to 
possess much advantage over the simple technique employed 
by Burt in 1917, and systematized by him in a recent article 
(1988). For multiple factor analysis, Thurstone’s technique 
(which is well described by Guilford, 1936) is the easiest 
to acquire, but is less accurate than the techniques de- 
veloped by Hotelling, Kelley, and Burt. The aims and the 
guiding principies of these latter methods are somewhat 
different from those dealt with in the present chapter. 
Since they are hardly comprehensible to any but advanced 
statisticians they have been omitted here. Виё ‘а clear 
account of them is available in Thomson’s book (1989). 


CHAPTER IX 
MENTAL TESTS 


MrNTArL testing is simply a more refined and scientific 
method of doing something which we all of us do every day 
of our lives, that is assessing one another's abilities and 
character traits. Our ordinary method is to observe a 
person's behaviour, including his facial expressions and 
gestures, in various situations and circumstances, and his 
Speech or written productions. For example, we might 
Watch a small boy playing with bricks, and from his skilful- 
ness, or the elaborateness of the resulting construction, 
jump to the conclusion that he is quite a bright child. 
Obviously the basis for this, and other such judgments, ic 
very haphazard and unreliable. There is no sure evidence 
that the boy is acting more, or less, intelligently than the 
Majority of children. Nevertheless this performance may 
quite well be refined and turned into a scientific mental 
test. So-called performance tests of intelligence are based 
on just such activities with bricks, blocks, and other con- 
crete material. 

Let us then first consider the essential features of a 
mental test which distinguish it from such everyday judg- 
ments and from the, slightly more scientifie, school 
examination. Е 

(1) Standardization of Test Situation.—The first step 
which the tester takes is to standardize the test situation. 
He will, for example, adopt some standard set of bricks and 
give definite instructions, 50 that all children who are sub- 
jected to the test shall do it under as nearly as possible 
identical circumstances. Until this is done it is not legiti- 
mate to compare the performances of different children. 
With every good test there is published a detailed set of 
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instructions, to which the tester must closely adhere. 
School and university examiners, on the other hand, usually 
give no instructions be ond stating the number of questions 
to be answered, and they neglect to inform the examinees 
what type of answers А ште. Because of this, and 
because of frequent ambiguities in the wording of questions, 
different examinees may conccive the tasks that аге set 


1 


them in a variety of different w ays, and it becomes exceed- 
ingly difficult to compare and mark their answers. 


The material of a test, its conditions of application, and 
its instructions, are further planned to be readily com- 
prehensible to, and suitable to the mental level of, the 
children or adults for whom itisintended. This is another 
feature which some school examiners might be advised to 
copy instead of setting subjects for compositions, or other 
questions, which are far above the heads of the pupils. 

We must, however, admit that several widely used tests 
do not entirely live up to these criteria. Some of them 
Were originally constructed and standardized in America ; 
nence they require considerable modification or translation 
before they are suitable for British children. We are 
forced to employ these ‘second-hand’ tests because the 
production of fresh tests is an extremely lengthy and ex- 
pensive business. Unfortunately different British testers 
do the necessary * re-modelling * in different ways, so that 
the test situation is no longer identical for all. 

(2) Standardization of Scoring.—The mental tester does 
not observe the child’s behaviour, or his finished product, р 
in the casual manner implied by our illustration, but obtains 
an accurate reeord of them. He insists on knowing de- 
finitely what was the test response, i.c. the child's reaction 
to the previously laid down test situation. _ Either the 
response is something which can be delimited in guante 
tive terms (e.g. so many bricks put into their correct posi 
tions in so many seconds), or else it is something which can 
be designated unequivocally as right or wrong.! It is 

1 So-called Quality Scales are tests whose scoring constitutes an 
exception to this rule, cf. p. 179. 
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most important that the personal opinion of the tester 
should play no part in this record. Any two testers observ- 
Ing and scoring the response should arrive at identical 
records. If this condition holds the test is called an 
objective one. АП good mental tests, then, are provided 
with scoring keys or manuals of instructions which enable 
the recording and scoring to be done objectively. 

Our everyday judgments of traits or abilities are, by 
Contrast, decidedly subjective, for numerous investigations 
have demonstrated their liability to vary with the person 
who makes them. Some teachers are still convinced that 
they can assess the intelligence of their pupils better than 
сап any test. But the fact is that different teachers, 
assessing the same pupils, give such different opinions that 
the scientific psychologist can put little trust in any of 
them. The same is true of many school and other examina- 
tion marks, as will appear in Chap. XI. 

(3) Norms.—The layman commonly seems to regard the 
Observed behaviour as having some absolute value or 
Significance. There is no doubt in his mid that it indicates 
high intelligence, or poor sociability, or whatever the trait 
maybe. The more cautious psychologist realizes that such 
judgments must be based largely on more or less vague 
recollections of the behaviour of other similar children under 
similar circumstances, and so insists that all grading of 
abilities is essentially a comparative or relative matter (cf. 
Chaps. II and IV). Thus one of the most vital accompani- 
ments of a mental test is the norms, by reference to which 
any child’s goodness or poorness of performance may be 
assessed. Unfortunately it is also the feature in which 
many published tests are defective, on account of the diffi- 
culties already described (рр. 79-87). а 

In the nature of examining it is hardly possible to estab- 
lish norms for examinations, since they cannot be applied 
to large numbers of pupils or students beforehand for this 
purpose, as is the standardized test. : We have seen, how- 
ever, that the process of rationalizing marks, which is 
essentially: equivalent to the setting up of provisional 
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norms, is necessary if the present ambiguities in the signi- 
fieance of school or examinations marks are to be avoided. 

(4) Reliability and Validity.—A single piece of behaviour, 
casually observed, as in our illustration, would probably 
be very low in reliability. A discussion of the statistical 
methods of determining reliability was given above (р. 145), 
and the reliability of several of the more important tests 
is mentioned later in this chapter. 

The most serious defect in the layman's everyday judg- 
ments is that he seldom has any good proof that the 
observed behaviour validly indicates the ability which he 
deduces from it. It may have arisen from some quite 
different trait or ability, or else from an ability which 8 
almost wholly specifie (in Spearman's sense), i.c. not corre 
lated with anythiüg else. To be reliable a test need only 
measure accurately the ability to perform that test, but to 
be vaiid it must also correlate effectively with othe? 
measures of whatever it is supposed to test. Тһе mere m 
spection of the content of a test is seldom sufficient for the 
determination of"its validity, although it is a necessary 
preliminary.” А 

Such inspection of educational tests and examinatio 
may reveal their failure to cover the ground-which the puP 
should have accomplished. Thus school and university 
exaniinations are often defective, because their questions 
cre too few in number, or are badly set. It will be obvious 
also to a tester that the content of some of the publishe 
educational tests is unsatisfactory, either because they are 
designed for pupils whose tuition differs widely from that 
current among his testees, or because they are out-of-date. 
But apart from such flaws, many tests involve group оц 
specific factors which detract from their effectiveness 25 
measures of some general ability. For example, а test 0 
reading may be based on the capacity for pronouncing 
difficult words. This may or may not indicate children 5 
capacities in other aspects of reading, such as their fluency 
or their comprehension of what they have read. Investiga” 
tion by means of correlations is essential for establishing 
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these points. Again, the technique of the test, i.e. the way 
in which the testee is required to express his ability, always 
seems to introduce an irrelevant factor. In Chap. XI we 
shall show that in an ordinary essay-type of examination, 
the translation of the examinees' knowledge into essay- 
form is one such factor, and in Chap. XII the examiner's 
translation of his questions into new-type or objective form 
Will be recognized as another. Temperamental factors, 
health, fatigue, and the like also influence a testee's per- 
formances at many tests, and so reduce their validity as 
measures of ability. 

Most educational and vocational tests are assumed not 
only to be valid indicators of some present general ability, 
but also to be prognostic of future achievement. Their 
predictions should then, of course, be foliowed up and com- 
pared with subsequently obtained measures of achieve- 
ment. In the educational fieid there have been many such 
investigations whose results usually show our tests and 
examinations to possess disappointingly poor validity (cf. 
Chap. XI). Vocational psychologists have also tried to 
correlate their predictive tests with some measure of out- 
put, or with estimates of efficiency obtained from super- 
Visors or foremen. As Farmer (1933) has shown, however, 
it is not always casy to secure а satisfactory criterion of 
success or failure in an occupation against which the test 
results ean be checked. 

The validation of intelligence tests, or of tests of capaci- 
ties such as practical ability, verbal ability, and the like, is 
particularly difficult, since we possess no objective criterion 
of intelligence, etc., with which to compare them. М Ratings 
have sometimes been used, that is, teachers’ estimates of 
the intelligence of children in their charge, but these are 
highly fallible. If psychologists are unable to agree as to 
what they mean by intelligence, it is unlikely that laymen’s 
views will be any more concordant. Indeed, if teachers 
could judge it accurately, there would be no need for tests, 
The main defect of such ratings is that they are strongly 
biased (often quite unwittingly) by considerations such as 
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the children's industriousness and good school behaviour. 
Parents? judgments are still less trustworthy. Several 
group intelligence tests have been validated by comparing 
their results with the results of one of the Binet scales. 
But the validity of these scales is opentodoubt. The items 
which they contain have always been chosen so as to 
differentiate older (hence presumably more intelligent) 
children from younger (less intelligent) ones. This aa 
erion is, however, dubious in value, since it would ox 
tests of, say, height and weight, which have very pi 
validity as indicators of intelligence. Yet we do not re 7 
merely on subjective judgment when we state that v 
must be excluded from a test of intelligence, for in ki 
absence of any external criterion of validity, we ma. | 
use of what has:been called the method of interna 
consistency. | ч . £3 
This method consists either of factorial analysis, or 0 $ 
modification thereof. All the test items or sub-tests ате 
selected in the first place on the basis of subjective орой 
they must appear to the tester to involve. the onn 
intellect. Thereafter they are studied empirically. Ei ж 
each item is compared with every other, or with the s 5i 
of all the rest of the items, or sets of items such as the ne. 
tests in à group intelligence test are compared with o a 
sets. The inter-correlations will serve to show ie 
pot all the items are consistent with one oum Whe "3 
they are measuring опе and the same thing. And i ge 
that fail to conform to indie шасии 
і 51 е test. 
Ши | ed вы. чатан, in his factorial analysis © 
А : uch more rigorous than those 
group intelligence tests are И Binet scale. Both, Бој“ 
Arrn ee 
through all their test items or vc aru а d TP 
his followers wish to measure nothing i |. а ne he 
2 i-factor pattern, and so exclude a numbe M. 
silk n which fail to conform to such E: patterns т 
Mines Bek em the other hand, although it certainly em- 
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ponies а contains a hotch-potch of heterogeneous 
would, if fully analysed, yield several group 
factors (а bi-faetor pattern), or even a set of multiple 
factors (cf. Burt, 19895). The effects of these- different 
techniques of test construction will be discussed more fully 
below. What matters for our present purpose is that when 
а test, or a set of tests, is internally consistent, this signifies 
that it will validly predict any other behaviour which in- 
volves the same general factor. Intelligence tests work 
fairly effectively because scholastic success, and many of 
the other intellectual activities of everyday life, are known 
to be dependent largely on the g which they measure. 
Test Construction.—Before proceeding, we would strongly 
Urge amateur testers and teachers not to attempt to make 
Up their own tests, apart from new-type ones for school 
examination purposes. To devise test items similar to 
those which appear in published tests is not difficult, but a 
great many technical features enter into the selection of 
Suitable items and into their standardization, in addition 
to those described here. The points which we have out- 
lined are meant, not to show how to construct a test, but 
to enable amateurs to choose more wisely the tests which 
will be most suitable for their purposes, and to give them 
some assistance in understanding what they are doing when 


they apply published tests. 


Types oF TESTS 


At the end of this chapter is given а classified list of most 
of the mental tests available in this country. The main 


headings are: - 


I. Attainment, Achievement, or Educational Tests. 
А. Examinations. 
B. Standardized educational tests. 

1. Quality scales. 

2-4, Individual tests. 


5-6. Group tests. 
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IL Individual Intelligence Tests. 
А. Versions of the Binet-Simon scale. 
B-E. Miscellancous scales. 
F. Performance tests. 
ПІ. Group Intelligence Tests. 
A. Verbal. ~ 
B. Non-verbal. 
C. Oral. 
IV. Special Aptitude Tests. 
V. Sensory-motor Tests. И 
VI. Temperament, Personality, and Character Tests. 


The last three sections consist of little more than n 
€nces, but the first three aim to be comprenen W 
Although we shallot attempt to describe particular M к, 
Еће following general explanations of, and comments EE 
the mein types may be useful to those who are not alrea 
thoroughly familiar with the боја. os ae 

Examinations and Educational Tests. Examination a 
still the most important method of eid а s 
ment, and so come at the head of our list. Their iS ү 
and the possibilities of improving them, or of emp pe 
substitutes, are discussed in Chaps. XI-XIV. Stan fot 
ized educational tests must be contrasted with in od 
although tney are made up of the same kinds of € 
questions as occur in objective or new-type bye E 
their functions are quite different. They are not des n px 
to cover the work done by particular classes of pe a 
students, nor for the selection of those who are ое E. 
to * make good ? in some future work. They Loa ОО 
what is the standing of a class, or of an individua TM 
relative to other similar VUA ERI IM. 
досун s: PN il ute fcio latio being known as 
of, or stages within, а subject ( The fact that they ae 
diagnostic,* or analyt xx е make them unsuit- 
published, and their expense, o ў У 

i i stic tests by Schonell (1985, 

1 Cf, the excellent discussion of diagno: 

1987). 
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able for use as school examinations. By means of them, 
however, a teacher can usefully grade his pupils at the 
beginning of the school year, so as to find which ones will 
need most coaching, or in which parts of a subject they are 
most advanced or backward. If the norms are adequate 
(ct. Chap. IV), they will tell him how his pupils compare 
with pupils in general of the same age. The psychologist 
at a Child Guidance Clinic, or the vocational adviser, finds 
them especially valuable for assessing a child's strong and 
Weak points. 

Quality Scales.—Although most educational tests consist 
of large numbers of short items, both so as to cover the 
Subject as thoroughly as possible, and so as to admit of 
objective scoring (cf. Chap. XII), some educational products 
Which are too complex to be assessed»in this piecemeal 
fashion сап nevertheless be measured by ' quality’ or 
‘product scales.’ Such standardized seales are available 
for grading handwriting, composition, drawing, etc. Each 
of these consists of a series of specimen products which have 
been carefully chosen as representative cf certain levels of 
achievement. The testee is told to write or draw the same 
matter in his normal manner, and his product is, as it were, 
slid along the scale, or compared with the specimens, until 
one is found to which it corresponds in achievement. It 
is then given the score of this specimen. Such grading 
cannot be made wholly objective, but there is evidence that 
it improves with practice. Different testers will, when ex- 
perienced, award the same ог nearly the same score to a 
child's product. Certainly the process 1s fairer than grad- 
ing without any external standards. Hence‘the adoption 
of a similar plan in the marking of school work was advo- 
cated in Chap. II. . 

"Burt's scales all consist of specimens selected from a large 
number as being typical of the work of CLABES 5-,6-, 
+. . 14 + children. Cattell s handwriting scale for school- 
leavers (13 to 15 +), and his drawing scale for adults, in- 

ћ „nical of the best 6%, the next best 25%, 
clude specimens typica Г em 2 hy 
the middle 389%, the next 25%, and the poorest 6% of such 
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persons. The scores assigned to these—I, IT, III, IV, and 
у, ди Ае аы equivalently spaced units (cf. 
p; 77). 

| Quantitative Tests and Scales.—The scoring of quantita- 
tive tests, either by the * rate? or ‘ power ’ systems, or by 
a mixture of these, has already been described in Chap. I. 
Tt will be noted that scarcely any tests are available much 
above the 14-year level, and that very few deal with sub- 
Jects outside the 3 R's. One obvious reason for this is 
that the subjects taught in post-primary schools are of com- 
paratively little importance to the majority of the popula- 
tion. Another is that the matter included under history: 
languages, mathematics, science, ete., varies so much from 
one school to another, that it is scarcely possible to estab- 
lish general stancards of attainment. Nevertheless it 
would not be difficult to devise and standardize tests suit- 
able for some fairly homogeneous groups of pupils oF 
students, such as those in English publie schools, in pro" 
vincial universities, in Scottish secondary schools, and 80 
on. Certainly ous tests lag far behind those available for 
similar purposes in America. 


INTELLIGENCE TESTS 


Intelligence tests closely resemble quantitative educa 
tional tests (to which they are, of course, historically prior) 
in their construction and scoring. Thus they may consist 
of: 

(a) А series of tasks graded in difficulty (power plan). 

(b) Tasks such as performance tests where the speed от 
number of moves are recorded (rate plan). А 

(с) Sets of questions which are roughly graded, bu 
where the score depends on the number answered within & 
certain time limit. 

Group tests generally include a * battery ’ of some half- 
dozen (between about four and twelve) ' sub-tests," each 
containing some twenty (between ten and fifty) questions 


or items. Alternatively they are arranged in ‘ omnibus 


— 
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form, where items of all kinds are mixedup. In the battery 
type, each sub-test has its own instructions and sample 
items, and is timed separately. In the omnibus form, the 
instructions and samples may be printed along with the 
questions, or included in a preliminary practice sheet, and 
one time limit suffices for the whole test. An omnibus test 
is therefore somewhat easier to give, but a test battery has 
the advantage of providing rest pauses every few minutes. 
Almost all intelligence tests thus consist of rather large 
numbers of items. Performance tests are exceptional, but 
it is desirable to apply several, say, six or more, of these. 
The tests are always applied and scored in a standard 
manner, and the scores either take the form of Mental Age 
years (as in the Binet and some other tests), or can be 
translated into M.A.s and I.Q.s by means of tables of 
norms, à 
The Innateness of Intelligence.—The main difference 
between intelligence and educational tests is that the 
former aim to include tasks whose solution depends upon 
native ability rather than upon acquired experience. 
Although the precise definition of intelligence is still a 
matter of controversy—useful discussions are given by 
Ballard (1922) and Knight (1988)— позе writers agree that 
some general innate capacity may be regarded as under- 
lying all our abilities. Some regard this capacity as con- 
stant and unalterable throughout life, but recent evidence 
Suggests that, at least in early years, it may be greatly 
stimulated by a favourable environment, or depressed by 
an unfavourable upbringing (cf. Wellman, 1938). All, 
however, would admit that it remains fairly constant under 
normal "eireumstances which provide neither unusual 
Stimulation nor inhibition. Би Ui es fs 
ili intelligence itself, we have to 
problem of the stability of T oes до г. 


i t 
recognize that we can never get а” 
OU itself only through the medium of acquired know- 


ledge or skills. Therefore the tasks included in an intelli- 
bits test should as far as possible involve only those 
acquirements which are likely to be equally available to 


18 
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all persons. For then if A scores well and B badly, we 
can claim that the difference is not due to A's having һай 
more opportunity or better training than B, but to the 
underly ing power which has enabled A to make better use 
of his opportunities and training. This conclusion fits in 
well with that reached in the previous chapter, namely 
that all measures of intelligence involve a group factor 

_ dependent upon the medium used in testing, as well as 
upon g. 

, Language is a fairly suitable medium from this point of 
view, and has been the most widely used of any. But 
verbal tests are clearly unfair to those who have not һай 
normal opportunities for acquiring language, such 25 
foreign-born or bilingual children, the deaf, and the children 
of gipsiés or canal burgees who receive little or no schooling. 
Further, it is probable that there are innate differences 1n 
verbal capacities which therefore further distort the scores 
obtained in intelligence tests dependent on manipulations 

. of words. Many testers therefore prefer non-verbal tests, 
where the media cónsist of pictures, abstract diagrams, or 
concrete (performance) material. These, of course, are 
open to analogous defects. A British or American chi 
has much more opportunity than an African child for 
acquiring skills with pictures, blocks, etc., and so has = 
decided advantage when he is made to express his intelli- 
gence through tests which involve such material. Nadel 
(1939) shows convincingly the absurdity of assuming that 
racial differences in intelligence can be measured by such 


tests. 

Tests of thé concrete or pictorial type are valuable for 
application to young children, who are not so accustomed 
as older persons are to verbal thinking, and are not inter- 
ested in verbal tasks. Non-verbal material may be essen- 
tial also in testing the dcaf or illiterate, and are uscful as 
ominantly verbal tests. But since 
if not the most important, use 
f predicting scholastic aptitude, 
dominantly verbal, verbal tests 


supplements to pred 
one of the most important, 
of intelligence tests is that o 
and school work is itself pre 
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are certainly the most useful. Though we know of no 
conclusive evidence, yet it seems very likely that verbal 
tests will correlate much better than pictorial or abstract 
ones with school work. The latter, however, may have 
more value in scientific and technical education. 

Versions of the Stanford-Binet Scale.—Up to 1987 the 
majority of testers used the Stanford Revision of the 
Binet-Simon scale, published by Terman in 1916, or Burt's 
unpublished restandardization of this Revision. But now 
the most popular version is the New Stanford Revision, 
published in 1987 by Terman and Merrill. ‘This version 
1s superior to any other both because it contains very full 
Instructions for application and scoring, which should 
obviate the confusions and errors previously so common 
among inexperienced testers, and because of ils greater 
extensiveness. Both the L and M forms cover almost the 
whole range of intelligence from that of 14 year children 


to that of adults. 


Now the original American form of the New Revision 


cannot possibly be applied as it stands in this country, 
because it is replete with words and phrases which require 
‘translation’ into English. Most of them have been 
altered in the edition published here, but some still remain. 
Since it would be unwise for every tester to make his or her 
own additional modifications, further revisions are being 
drawn up by psychologists. A representative committee 
under Professor Burt has prepared a complete revision of 
Form L, which often differs very widely from the present 
British edition. A committee of the Scottish Council for 
Research in Education has prepared a list of minimum 
alterations for Forms L and M, which can be obtained on 
request, and copied by testers into their handbooks. 
None of these three British versions is at present standard- 
ized... Probably some of the tests are distinctly easier, 
others more difficult, for British than for American children 


1Cf. Terman-Merrill Intelligence Scale in Scotland, by D. 
Кашу Ж a (London : University of London Press, 1945). 
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of the same age? But the final scores (M.A.s or 1.0.5) 
yielded by the test are likely to be nearly correct, since 
previous research with the 1916 Revision bas shown that 
the Stanford-Binet norms for America, Scotland, and 
England are practically identical? There is, however, 
still some doubt as to the adequacy of the tests for higher 
Mental Ages (12 years upwards). In the earlier versions 
these tests were too difficult, and high 1.0.5 among older 
‘children and adults were generally not high enough (cf. 
Vernon, 1987а). In the New Revision the opposite is 
found ; that is, the high I.Q.s tend to be too high. But 
there is not yet sufficient evidence to enable us to specify 
the extent of this error. Until a proper restandardization 
is carried out, the tester should be very cautious at these 
higher levels. He must, in addition, expect to find : 

(a) That many of the tests are misplaced. A child may 
fail all the tests in one year, but yet pass one or two in а, 
higher year. Hence thorough testing will necessarily take 

,rather a long time. 

(b) That the items within any one test are often in wrong 
order of difficulty. The Vocabulary words, some of the 
Absurdities, and other sets of items, require rearrange- 
ment. 

(c) That the typical responses quoted in the scoring in- 
structions are of little use as they stand, since, even after 


+The Vocabulary test in particular seems to be much too easy» 
though this was not true of the same test in the 1916 Revision- 
Preliminary research by the present writer has indicated that Terman 
and Merrill's norms need to be adjusted as follows : 


Mental Age level . . 6810 12 14 15:417:4 19:10 22:10 
АЈА. S.A.I S.A.II S.A.III 


Merrill . 58111416 20 23 26 80 


Vocab. ) Terman and 
ЕЕ norns. 5 9 18 17 21 24 29 36 42 


Words 


Burt (19395) mentions several other instances of serious misplace- 
ment, and finds that nearly half the tests are a year or more too high 
or too low. g 

2 Cf. Scottish Council for Research in Education (1988), and Burt 
(1935). 
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some translation, they represent the thoughts and expres- 
sions of American children. The tester must try to decide 
whether his testees’ responses correspond in intellectual 
level to these acceptable and unacceptable responses. 
One other noteworthy disadvantage of the New Stanford 
Revision is the expensiveness of the concrete material 
needed for testing young children with Mental Ages of 4 : 0 
or less. However, the tester who is concerned only with 


children of chronological age 6 or more can dispense with’ 


it. The only concrete material needed in later tests is the 
beads, blocks, and printed cards, and these can be obtained 
separately. 

Performance Tesis.—As we have already seen, these con- 
Crete tests may be employed when verbal ones are in- 
applicable, or may usefully supplemert the predominantly 
verbal Stanford-Binet test. Their chief merits are their 
greater attractiveness to children, and their capacity for 
bringing out a variety of temperamental reactions. A 
testee’s manner of approach to such tests, and his behaviour 
when confronted with difficulties, throws much light on his 
impulsiveness, persistence, complacency, and other quali- 
ties, which cannot as yet readily be tested or measured 
objectively. | 

The majority of performance tests suffer from the praeti- 
cal drawbacks that they are very unwieldy and difficult to 
transport, and extremely costly. Some testers are sus- 
Picious of them because different tests often give very 
inconsistent results. A child with a Binet M.A. of 9 years 
may, for instance, pass performance tests at an ywhere from 
the 7- to the 12-year levels. One reason ior this is their 
Poor standardization. The published norms may be in- 
adequate (cf. Vernon, 19374), or the American standards 
may be incorrect for this country. ‚ Often also the test 
material or method of application differ in certain small 
respects from those current when the test was standardized, 
So that the test is thereby rendered somewhat easier or 
more diffieult. As performance tests have to be applied 
individually, the task of providing accurate British age 
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norms has as yet hardly been attempted. Even if they 
were perfectly standardized, we should still expect con- 
siderable inconsistencies, since their inter-correlations show 
that each test involves a large specific factor, ог unknown 
group factors. Most of them are also poor in reliability; 
and the evaluation of this feature is difficult, because they - 
can hardly be applied twice to the same testees, except 
after an interval which is long enough to wipe out memories 
of the earlier testing. 

Undoubtedly then their validity, either as measures of 
g or of some practical group factor such as Alexander’s F, 
is poor. The conception of * practical ability ' in everyday 
life is even more obscure than the conception of general 
intelligence. Quite frankly, we cannot as yet say what 
type of ability in «daily life they serve to measure. It is 
interesting to note that several of them are markedly 
affected by emotional maladjustment or neuroticism. 
Children or adults tested at a Psychological Clinic tend to 
make poorer scores on them than on the Binet test. This 
dependence on affective group factors constitutes another 
reason why they are not very efficient measures of ability Я 
Thus the mental tester should never place much reliance 
оп a single performance test, nor оп two or three. A smal 
number may be sufficiént to indicate whether a Binet test 
result is likely to have been distorted by some verbal defect 
in the child, but the median or rnean score from at least 
half a dozen of them, or the total result from а compre- 
hensive battery such as Drever and Collins's, should be 
taken if а reliable performance test M.A. is desired. 


Group TESTS VERSUS ÍNDiVIDUvAL TESTS 


The merits and the defects both of individual tests such 
as Stanford-Binet or performance tests, and of group 
verbal or non-verbal tests, may best be brought out by con- 
trasting them under the following series of headings : 

(1) Age Range.—Individual tests must be used with very 
young children because it is practically impossible to hold 
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the attention of a group, or to get them allto act alike at , 

the same moment. Further, it is not natural for children 
to apply their intelligence to symbols on paper—either 
words, pictures, or diagrams—until they have settled down 
to formal schooling. Manipulation of concrete objects or 
individual conversation with the tester constitute much 
more appropriate testing media. Group tests are available 
which claim to measure Mental Ages down to 6 or even 5 
years. The instructions for these must, of course, be given 
orally. But their value is very doubtful, for the above 
reasons. They may sometimes be useful for classifying 
Pupils aged 7 --, i.e. children whose M.A.s are likely to run 
from 5 to 9. 

By 9, or better 10, years, children can read instructions 
for themselves, are thoroughly accustomed to class work, 
and can readily deal with a variety of verbal problems. 
This is therefore a very suitable age at which to begin group 
testing. Above 14, group tests have definite advantages 
Over individual, both because they can readily be made 
difficult enough even for the highest intelligence levels, and 
because their items are much less apt to appear silly and 
childish to suspicious adolescents or adults. 

(2) Ease of Application.—Group tests are easier to pro- 
cure, to apply and to score, and they demand much less 
training and experience on the part ofthetester. Possibly 
however, these are disadvantages, since so many unskilled 
Persons misapply and misinterpret them (cf. Chap. X). 

(8) Time.—Needless to say, the time saved by applying 
group instead of individual tests is enormous. For in- 
Stance, a class of forty can usually be tested. in less than an 
hour, and their answers scored in four to eight hours, 
Whereas an experienced Binet tester needs about half an 
hour for each young child, three-quarters for an older one, 
and an inexperienced tester much longer still. The point, 
however, is whether this three-quarters of an hour per child 
is worth while. The great majority of psychologists would 
certainly say that it is, and that an individual test can 
reveal much better than a group test things which the child’s 
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teachers and parents may have failed to recognize even 
after years of acquaintanceship. Probably the best com- 
promise would be for the School psychologist or infants' 
mistress to test each child individually at 6, for group tests 
to be applied in class at 10, and possibly at 12 and 14, and 
for individual tests to be given at these later ages only to 
exceptional pupils whose group test results are very 
puzzling, or about whom some important decision has to 
be made for purposes of educational or vocational guidance. 

(4) Time Limits.—Most of the items in the Binet scale 
and many individual performance or educational tests аге 
untimed, whereas the great majority of group tests have to 
be done at maximum Speed. Are not, the critics ask, some 
children Slower but surer than those who score well on these 
timed tests? That the answer is probably in the negative 
has already been indicated in the previous chapter. A 
large body of research shows that ‘speed’ and ‘ power 
factors cannot be effectively distinguished.! Certainly if 
time allowances were increased the slower testees woul 
Improve their scores, but experiment proves that they 
would still remain relatively poorer than the quick testees, 
and (according to the principles of Chaps. II and IV) it is 
relative, not absolute scores that matter. Untimed group 
tests have been construeted, but they are inconvenient 
both because the questions have to be made more difficult, 
and because different testees finish them at different times- 
Possibly also they may introduce the temperamental factor 
of persistence or wilingness to go on trying, which 15 
irrelevant in measuring intelligence. 

At the same time, it should be remembered that factorial 
studies of ‘ speed versus power ’ are based on correlations 
between scores of groups of persons, and that even quite 
high coefficients may admit big exceptions in individual 
cases (cf. Chap. VII): It is conceivable then that the tim- 


^. A group factor of mental speed has emerged from Holzinger’s 
extensive investigations (1934), but it appears to be very small. For 
а. useful summary of earlier work on the problem, see Spearman 
(1927). 
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ing may unfairly affect a few children or adults, especially 
those who are emotionally maladjusted. This is one reason, 
if rather а dubious one, why Clinie psychologists prefer to 
use individual tests. 

(5) Objectivity.—NWith good group tests the conditions of 
application and the scoring are much better standardized 
than with most individual tests. Very little is left to the 
personal decision of the tester, and the marking is, or should 
be, completely foolproof. In actual practice many group 
testers still manage to commit gross errors, some of which 
Will be pointed out in the next chapter. And modern in- 
dividual tests like the New Stanford Revision afford much 
less Scope for subjective whims than did the older ones. 
Nevertheless individual testers often have to judge the 
Correctness of doubtful responses, and, may, without de- 
parting from the instructions in the handbook, alter the 
testing conditions in a variety of ways. Clearly, then, if 
different testers are liable to arrive at different results when 
testing the same children, their results are not objective or 
reliable. Yet there seems to be no evidence of serious dis- 
crepancies between the results of trained testers, although 
untrained ones may make grave mistakes. The reliabilities 
of the 1916 Stanford and Burt-Stanford Revisions were 
very high so long as the testees were in the middle range 
of Mental Ages, and were not suffering from serious emo- 
tional disorders. The New Revision should be even better. 
Greater difficulties, and therefore less reliability, occur in 
testing children below the age of 5 years, or adolescents of 
13 + and adults. Also occasional children at a Psycholo- 
gical Clinie are so unco-operative or unsteble that their 
results cannot be trusted. But the experienced tester 
Should be able to recognize when his judgments are un- 
Certain and try to obtain a re-test by someone else who 
may be able to handle such cases better. 

Actually the rigid objectivity of the group test is dis- 
trusted by Clinie psychologists, since they know that the 
· Same physical situation may mean very different things to 
different children. The flexibility of individual testing is 
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to them a great advantage which, so far as is known, they 
do not abuse. d 
(8) Dependence on Health and Mood.—Both group ап 
individual tests seem to be much less affected by these 
factors than might be anticipated. Experiments „show 
little and often no difference between the scores of bu 
when they are fatigued or ill, and when in good health, m 
between those who are Strongly motivated (e.g. by ү 
promise of monetary rewards) and those who are шаси 
encouraged in the ordinary way. Studies of the latter ty Pe 
Sometimes show that testees who are making more ейн 
may attempt more test items, but they also make поз 
mistakes so that their scores are scarcely altered. тоу 
seems then to be little evidence for such statements as Me 
the examination conditions of group testing make и. 
testees unduly nervous, or that others are stimulated to 
better by the competitive nature of the testing situation 
All such conclusions, however, are based on averag 
tendencies, which admit occasional individual excep а 
and Binet testing £ertainly indicates that a few testees me 
So resistant or timid that they fail to do themselves justi 
More extreme differences in attitude to the test can di 
doubtedly distort the scores, and certain physical ES 
tions such as bad eyesight must have a like effect. di- 
intelligencé test Possesses perfect reliability ; and con Та 
tions such as these may be at least partly responsib E 
Test reliability is known to be particularly low among Ч) 
stable testees (cf, Tulchin, 1984). = is 
The great advantage of individual over group testam d 
that in the former the tester can generally discern th 
influence of abnormal conditions, whereas in the latter h€ 


* For a fuller discussion of this, and the two following, sections» 
see Cattell (1937) and Vernon (19875). А lack 

* This was found by Barron in an experiment on strict and 5 qe 
conditions of Supervision in group testing, carried out under Я 
Present writer's direction. Тһе correlations between two form d 
Ф 8тоир test applied under similar conditions averaged + OS 
But the same tests given under different conditions only correla m 
+ 0-573, showing that the conditions have a marked effect on wha 
the tests measure. 


MENTAL TESTS 191 


has scarcely any control over, or insight into, them. It 
. may be that while high scores on group tests are relatively 
stable and significant, low scores can arise from a variety 
of sources other than poor intelligence. However, a lot 
more careful research is needed in order to determine 
definitely the part played by such factors. 

(7) Dependence upon Acquired Inftuences.—The Stantord- 
Binet test has in the past been especially strongly criticized 
for including tests of information which might depend on 
Schooling, social background, and thelike. Many of these 
tests (e.g. Names of Coins, Days of the Week, etc.) have 
disappeared from the 1937 Revision, and there seems now 
little to choose between individual and group tests in this 
respect. Indeed the new Binet test may be superior in 
that it involves a wider variety of mental operations than 
the manipulations of words which go to make up most 
group tests, and which are very likely to be affected by 
Schooling. It possesses the further advantage that the 
individual tester can often gauge the incidence of environ- 
mental factors such as poor social background, lack of 
verbal facility, etc., which the group tester cannot do. 
The former cannot, of course, make any scientific correction 
for abnormal upbringing and environment, but he can 
apply additional non-verbal tests, and can discount the 
verbal ones if their results seem unreliable for such reasons. 
As we have already seen, no test measures innate intelli- 
gence directly, and the results of tests only hold good for 
those who have had similar experience, or similar oppor- 
tunities for acquiring facility with the various media— 
Words, pictures, or concrete material—through which in- 
telligence may be expressed. 

(8) Dependence on Practice or Coaching.—Most intelli- 
gence tests are published, hence anyone can procure them, 
coach testees on them, and ensure considerably increased 
scores, Tt is a most foolish thing to do, since a child who 
is made to appear unduly intelligent may also legitimately 
be expected to be capable of unduly difficult school work. 
But whenever examinations for secondary school entrance, 
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or for other purposes, include intelligence tests, the teachers 
who wish their pupils to do well will naturally be tempted. 
Binet testing it is fortunately not difficult to recognizè 
when children have been coached, since their glib answers 
are very different from the answers of children who are 
„Puzzling out the tests for themselves: With group tests, ОЛ 
the other hand, this is seldom possible unless (as has 
actually Occurred) a whole class answers 19 test items out 
of 20 correctly, the 20th being one where the teacher va 
mistaken. In order to prevent coaching, many Education 
Authorities obtain unpublished tests from Moray House, 
Edinburgh, where they are constructed and standardized 
Privately. : 
Apart from coaching on a partieular test, evidence i5 
accumulating of а, considerable practice effect among tests 
1n general. The fact of having recently done Test A tends 
e Improve scores on Test B, and when Tests A and B ате 
similarly constructed, coaching on Test A seems to sen 
up Test B scores still more. Whether, when A and B аге 
rather dissimilar „tests, over-coaching on А may хедис“ 
Scores оп В, is not yet known. But this seems quite 
Possible in the light of experiments on transfer. Terre 
and Merrill (1987) have found a practice effect of 2 to | 
points of I.Q. if Form M of the New Stanford Revision Ж 
taken shortly after Form L, or vice versa. The effect in 
group tests may ђе considerably larger, at least among | 
testees of superior intelligence (cf. Vernon, 19882), since 
group test questions are generally presented in a more 
Toundabout form, a form which seems very strange #0 
naive testees, In the Binet test the questions are aske 
naturally and directly, апа the testee usually supplies ап 
answer in his own words. But in group tests the items 
are of the ‘ closed > or * recognition ' type. The testee has 
to select the best of a number of answers, in accordance 
with instructions printed at the top of the page. The 
sophisticated testee is likely to become familiar with the 


* Cf. work by H. Macrae, carried out at Jordanhill College School; 
Glasgow, not yet published. 
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instructions commonly used, and to pick up various more 
efficient methods of selecting the right answers quickly. 
Much as the cross-word puzzle habitué becomes used to the 
; language? in which cross-word clues are set, the testee 
may become progressively accustomed to the ‘language ' 
In which mental tests are set. 

_ The magnitude of such practice effects is not yet pre- 
cisely known, but it is probably large enough to make the 
use of the same norms for naive and for sophisticated testees 
unfair, The school psychologist, therefore, had better not 
test children more frequently than, say, once in two years, 
and should keep careful check of any practice the children 
ау get on educational, or other similarly constructed, 

ests, 

(9) Validity—The Binet test, as already mentioned, 
measures an unanalysed hotch-potch of abilities, and it 
has been: severely criticized by Spearman, Catteli, and 
others on account of the weaknesses of its statistical con- 
Struction and theoretical foundations. A properly con- 
structed group test, on the other hand, is claimed to yield 
а measure of almost pure g. And yet most psychologists 
prefer the Binet test to any group test when they are called 
on to make any decision involving the intelligence of an 
individual child. They regard it as their most valid 
mental measuring instrument, i.e. as giving the best avail- 
able index of a child’s ‘ general intelligence.’ Few experi- 
enced persons appear to put much trust in an individual’s 
group test score, not merely because of prejudice or con- 
Servatism, but also because different group tests are known 
to yield remarkably discrepant results. ‘Correlations 

etween them are usually around + 0-7 to + 0-8, and a 
child who is given two or more may obtain I.Q.s differing 

У as much as 80 or even 40 points. In the present writer’s 
view, group tests are very useful for testing groups, e.g. for 
Comparing a whole class of children with the norms, or for 
large-scale experiments, but not for obtaining trustworthy 
information about individuals. What reasons may be 
advanced for this paradoxical state of affairs 2 
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It seems probable that the intelligence which we recog- 
nize in daily life, in school work and other activities, and 
Which we wish to measure or predict by our tests, is never 
pure 4, but is always admixed with various group factors. 
And that very similar impurities occur in the items of thg 
Binet scale, so that the М.А and I.Q.s which it yields 
correspond excellently with our conceptions of intelligence: 
From the factorist’s point of view this is a most Un 
Systematie and haphazard state of affairs, but it happens 
to work wel. In more colloquial terms, Binet and his 
followers chose items very largely from children’s everyday 
life experiences, regardless of their factor content, aP 
therefore the test as a whole gives a much better measure 
of everyday intelligence than would perfect tests of Un 
adulterated g. Further it would seem that intelligence у 
daily life is not a purely cognitive process, but is close у 
bound up with the child's or adult’s emotions and interests: 
Only under favourable affective conditions are its Ма 
powers displayed. Now the Binet tester, through his реза 
Sonal contact with the child, has considerable control ova 
his state of mind ; he can produce favourable conditi o 
and can judge, albeit subjectively, whether the chil a 
геѕропѕеѕ are typical of the best that he can do. Fro 
general conversation and from the child's manner M 
approach to difficulties (particularly those offered by ре 1 
formance tests) he builds up a picture of the child's M 
personality, and observes how that personality actua Л 
applies his available intelligence. He therefore Meer a 
much more complete and practically useful view than = 
could from а more scientific measure of cognitive intellec 
tual ability. tly 

In group tests, on the other hand, the items are mos ild 
of a much more restricted and artificial type. The chi 
is forced to express his intelligence through a less natur? 
medium. All that the tester gets is the numerical score; 
with none of the additional indications which are so uscfu 
in interpreting Binet test scores. If such tests do measure 
pure g, then the general intelligence which psychologist 
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and teachers want to measure must be very impure. 
Actually the discrepancies between the results of different 
tests either show that most of them are not good measures 
of g, or suggest that (despite the small effects of mood and 
health conditions) there must be many uncontrolled factors 
which upset the scores in unknown fashion. 

In the present writer’s view, testing must be done indivi- 
dually. Yet he would agree with Spearman, Cattell, and 
Kent (1937) that the New Stanford Revision is a distress- 
ingly unscientific instrument owing to the heterogeneity of 
its content. Burt (19396) has recently published a pre- 
liminary account of an analysis of its items by modern 
factorial methods, which both demonstrates this hetero- 
Beneity and shows that it is responsible for great irregulari- 
ties in the order of difficulty of the items among different 
testees, Boys and girls, bright and dull pupils, and chil- 
dren from higher and lower social grades, all tend to give 
Somewhat different orders, because they possess different 
patterns of the group factors upon which the items depend 
(іп addition to g). A further consequer.ce of this hetero- 
geneity is that it tempts testers to make dubious subjective 
analyses of types of items, as when they note that a child 
does especially well on the practical items, or badly on the 
memory items, and deduce, without any evidence, that he 
15 a ‘ practical? child in daily life, or has poor * memory ° 
for school work. Burt’s investigation should help cons 
Siderably in clearing up these ambiguities. Nevertheless 
the writer believes that an alternative method of testing, 
advocated by Kent, would be much preferable. A series 
of individual scales should be collected gradually, each of 
which should measure some homogeneous ability, but which 

1 A personal note of explanation seems to be needed here. In a 
recent article (19875) I put forward the strongest arguments I could 
Muster against factor theories, and in favour of the Binet test. 
Actually I was attempting to express the views of clinic psycholo- 
gists, more than my own, which have always been sympathetic to 
factor theory. But I must admit that I have been influenced, partly 
by Cattell’s reply to my article and partly by Kent’s able criticism 


of the Binet test, and therefore appear less favourable towards it i 
T , it 
this book than previously.—P. E. V. js 
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should also be sufficiently different from one another and | 
varied in character, to cover as wide a range of intellectual 
operations as the present Binet and performance tests. 
Such homogeneous abilities need not be pure factors in the 
statistical sense, nor independent of one another, but their 
diagnostic worth could be scientifically verified. The tester 
would choose the scales most appropriate to his particular 
testing purpose, and to his testee’s age and temperament. 
This method would require greater training and skill ОП 
the part of the tester, but it would seem to be by far the 
soundest way, both from the theoretical and practical 
viewpoints, in which testing might develop. 


ye E TESTS AND METHODS or MEASUREMENT IN 
EDUCATIONAL Рзусногов“ 


The following list includes most of the mental tests avail- 
able in this country, together with some particulars of age 
limits, times, publishers, prices, and references to literature 
А few may be omitted through pure ignorance on the pat 
of the writer. But usually tests are excluded either 
because no reasonable norms are available, or because they 
are likely to be too unreliable or too inaccurate to be wor t 
Using. Apart from a few performance tests, no American 
or other foreign tests are included, unless they are also pub- 
lished in this country. It is not possible to provide ful 
2nd uniform information regarding every test, nor ca? 
complete accuracy be guaranteed, although every item has 
been checked. Age limits usually refer to the ages for 
which norms:are provided, not to the ages for which the 
tests are suitable. We shall see later (Chap. X) that the 
latter are much more restricted than the former. When 2 
definite figure for the time is given, this usually means the 
actual working time. An approximate figure, e.g. с. 20 
mins., usually means the total time, including the distribu- 
tion of test blanks, reading of instructions, ete. Prices аге, 
of course, subject to alteration. The prices of tests issue 
by American firms are doubtful on account of the exchange: 
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‚ and are liable to an increase owing to customs duty.* The 
list does not usually include the price of the test manual 
and scoring key, but most publishers provide these, to- 
gether with a sample copy of the test blank, for about 
l|- per test. References to books or articles where 
descriptions of, or norms for, the test may be found are 
generally listed after the name of the author of the test. 
Additional references and notes are often needed, and these 
ате put in footnotes. 

Publishers names are abbreviated as follows: B = 
Baird, Edinburgh. Е = Experimental Instruments Co., 
Sudbury, Suffolk. GŒ = Gibson, Glasgow. H = Harrap. 
К = Bar-Knight Model Engineering Со., Glasgow. L = 
Lewis, London. N = National Institute of Industrial 
Psychology, London. О = Oliver andoBoyd, Edinburgh. 
S= Stoelting, Chicago. U = University of London Press. 

An asterisk (*) in the prices column indicates that the 
answers to this test can be written on blank paper, and the 


printed blanks used over again. 


1 Testers are advised to order American tests through their own 
educational bookseller. Customs officials will often allow material 
80 ordered to ђе * educational ’ and therefore free of duty, but may 


demand duty from a private purchaser. 


Note, 1953.—Most of the current prices are (including purchase 
tax) approximately double those quoted here. A few of the 
tests are now out of print, and a few additional ones have been 
published, notably those contained in Diagnostic and Апаттет 
тшш; by F. J. and F. E. Schonell (Edinburgh: Oliver & Boyd, 

0). 
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208 THE MEASUREMENT OF ABILITIES 


IV. SPECIAL APTITUDE TESTS 


Almost the only special aptitude tests standardized for | 
use in this country, and worth using, are those of the. 
National Institute of Industrial Psychology, to wh o 
application should be made. They include the pei 
Mechanical Aptitude and Manual Ability Tests (Cox 4 
Group Tests of Clerical Aptitude and of Form Relation 
Ability; Tests for Motor Drivers, ete. For descriptions 0 
these and other tests see Oakley and Macrae (1987), P) 
the files of The Human Factor (Journal of the x 
For а. description of tests of artistic and musical abilities, 
see Vernon (19855). 


V. SENSORY-MOTOR TESTS 


А. Bodily Physique ; strength, quickness, and stadi 
of movement ; fatigability. The best-known tests ud 
clude dynamometers, tapping tests, aiming tests, "ess 
time, pursuit-meter, dotting machine, ete., few of у ced 
however, are adequately standardized. Referen а 
Whipple (1919), Oakley and Macrae (1987), Sutcliffe 
Canham (1987). 

B. Right- or left-handedness. Cf. Burt (1987). ји 

С. Vision. 1. Colour blindness, Tests include Ho A 
gren wools;  Edridge-Green lantern;  Collins-Drev' 
Tshihara, and other figure tests. СР, Collins (1987). 

2. Right or left eye dominance. Cf, Burt (1987). his- 

D. Hearing. 1. Auditory acuity. Tests include У 


pered speech, audiometer, ete, Cf, Whipple (1919), B 
(1987). 


2. Sense of Pitch, 
(19855). 

In addition, oculists, 
other specialists use а n 
within this group. 


n 
intensity, rhythm. Cf. Verno 


d 
Speech therapists, doctors, a 
umber of tests which would 
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VI. TEMPERAMENT, PERSONALITY, AND 
CHARACTER TESTS 


„А. Ratings or assessments of personality traits by 
acquaintances. Cf. Vernon (19380), Symonds (1981), 
Hunt and Smith (1935). 

B. Performance tests, e.g. Downey Will-temperament, 
tests of perseveration and fluency of association, tests of 
character, Cf. Vernon (1985а), Symonds (1931), Cattell 
(1986). 

4 С. Questionnaires, attitude scales, inventories, and 
interest blanks. Cf. Vernon (19885). 

D. Qualitative approaches: observations of manner 
during testing or interview ; free word association ; 
Rorschach inkblots ; Mosaics test ; drawings. СЁ. аг] 
(1939), Burt (1935), Vernon (1935c, 19385). 


CHAPTER X 
HINTS TO TESTERS 


. articular 
1. Choice of a Test.—In choosing a test for some particul 


Е x; Р course: | 
Purpose, the tester's first consideration should, of cours’ 


be its validity for that purpose. As we have soon w 
Chap. IX, he may be able to judge its validity to We: 
extent by inspection of its content, but he would be be nal 
advised to seek cut experimental proof of what it is t by 
the test measures, Investigations by its author, E 
others who have used the test, should be consulted. ility 
mere fact that it is called a test of such-and-such an abi Я n 
does not constitute evidence. The second considere ои 
(which greatly affects the first) is the test’s тећавј 
Generally Speaking, a test must be fairly long, and 


А á Ж ы d v in this 
Scoring must be objective, if it is to be satisfactory in t | 


respect. Most authors of good tests publish evidence 9 
their reliability in the accompanying manuals. ^ 
trust should be placed in tests whose scores are not kno" 
to yield a reliability coefficient of --0-90 or over. 1 
" 9. Age Ва 2ge.—Few, if any, tests prec to cover 4 
56 Капа ew, If any, tests presume rM 
ranges of ability, and most educational and intelligence 
tests are only intended to cover quite a limited rang 
Thus the tester should be careful to choose from among 
the available tests those Which will, when applied to a large 


group of his testees, be likely to yield a normal distribution | 


of scores, and which will not unduly curtai] this distribution 
at either end. For instance, in testing an ordinary 
school class, he should first estimate roughly the 
range of М.А. or E.A s which it contains. A class 0 
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majority should lie between 73 and 121 years. He should 
then study the ages for which the various tests provide 
norms, and pick one which covers as much of this pre- 
dicted range as possible. If he has reason to believe that 
the average M.A. or E.A. of the class will be, perhaps, a 
year above or below its average C.A., he must correspond- 
ingly adjust his choice of test. 

The ‘differentiating power? of the test norms should 
also be noted. They should show a fairly large and even 
increase in score from one age level to the next. Suppose 
а test containing, say, 200 items, to provide the following 
norms from 6 to 13 years. The increases arc large up to 
9 Or 10, but then tail off, till there is only a difference of 5 
items between the 12 and 18 year scores. Such a test will 
clearly not discriminate or differentiate well at the upper 
€nd, and its effective range of norms is from 6 to 11 rather 


than from 6 to 18. 


Years: . 6 7 8 $. 30 10 19 18 
Score: . 40 71 101 128 151 168 179 184 


Similar considerations apply to the choice of tests whose 
norms are issued in percentile, or other, forms. 

8. Convenience of Arrangement and Scoring.—1f a test is 
to be used extensively, attention should be paid to its 
Beneral arrangement from the point of view of ease in 
applying and scoring. With some performance tests and 
tests of special aptitudes, much time is wasted in setting 
Out the material, or the instructions are very elaborate, or 
the scoring unnecessarily detailed. More modern Ameri- 
сап tests usually try to reduce these inconveniences. 

Group tests of attainment or intelligence may also vary 
greatly im ‘mechanics ° or ‘lay-out.’ Some of the older 
ones have insufficient instructions and sample items or 
practice sheets, so that testees often fail to understand 
what they have to do, and a lot of additional oral explana- 
tion is needed. Sometimes the testees have to answer in 
their own words, and the scoring manual never allows for 
all the possible answers which turn up, so that much time 
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ls wasted in deciding whether or not an answer shall E 
counted. The positions of the answers are often "€ 
all over the page, instead of being arranged in the Es. 
hand margin. True, stencils are often provided which, ря Ё 
laid over the test page, show up the correct answers. о 
according to the present writer's experience, it is far EL. 
trouble to fit the stencils over each successive page t oni 
it is to compare the testee’s booklet with a booklet 
Which all the correct answers are clearly marked. —— c 

Many American group tests are now issued with an in КУ; 
leaved carbon-paper device, so arran ged that all the corr ly: 
answers, and none of the incorrect ones, are —— — ES 
shown as crosses on the back of the page. The scorer t ly. 
merely has to count the number of crosses. Alternatiu 
testees may check their answers with a special сат 
pencil ; their blanks are then passed through an electri z у 
machire which counts instantaneously the numba E 
carbon marks in correct positions, through the amoun! is 
current which passes. In this country, however, ману i 
seldom carried out on so large a scale as to warrant 5 
ingenuity, and its consequent expense. ecm 

4. Procedure.—Many of the following hints may 5 the 
Puerile to some readers, yet they are all based on by 
Writer's experience of mistakes actually committed ste 
amateur testers. Such mistakes may often lead to qu 
Serious consequences, 


4 е 
The tester should first study the instructions and th 
test material carefull i i 


the letter. Often it is Worth the tester’s while to take the 
test himself. In Binet testing, 
trusted until such time as the te 
instructions, and the essenti 
until he has given the test to, say, t у 

5. Modification of Instructions a рт inex“ 
perienced tester is very apt to find flaws in the test 01 
instructions which, he thinks, could be eliminated by som 


simple modifications, Very likely the test could be im- 


e 
ster practically knows t 
als of Scoring, by heart, an 
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proved, but then it would no longer be the same test, d 
the published norms would no longer hold good. For the 
same reason a printed group test must not be duplicated 
or mimeographed in order to save expense. Quite apart 
from the probable infringement of copyright, the difficulty 
of the test is likely to be altered thereby. 

6. Timing.—Accurate timing is essential In some 
group tests a mistake of 5 seconds may quite possibly 
lead to an error of 1% or 2% in average 1.0. Testers who 
do not own stop-watches are advised to use omnibus tests 
when feasible, since these require only a single timing. If 
times are kept by a watch with a seconds hand, the exact 
minute and second at which a test is to stop should be 
written down as soon as the test has started. Mistakes 
very easily occur if the tester trusts to his memory. 

7. Underlining and Erasing.—Before giving a group test 
to a class, all rulers and indiarubbers should be firmly 
removed, otherwise the pupils will waste several minutes in 
ruling their underlinings, or rubbing out responses when ` 
they change their minds. If the test instructions do not 
already say so, the tester should explain to the pupils that, 
In order to change a response, they should scribble out the 
first choice rapidly, and then mark or underline the new 
Choice clearly. 

8. Distractions and Copying.—1n all testing, possible 
Sources of distraction should be avoided. For instance, a 
group test should not be given when a singing lesson is due 
to occur in the next classroom. The testees should be 
Spaced as far apart as possible, since it is very easy, in 
multiple-choice tests, for one to overlook and copy the 
answers of another. Often the younger children have no 
intention of cheating, but are yet extremely anxious to 
write down the same as their neighbours, simply because 
the whole situation is unfamiliar to them, and they are 
afraid of doing the ‘ wrong thing.’ The motive behind 
such copying is a desire for social conformity. 

9. Practice Sheets and Samples.—In most group tests for 
children there are ‘preliminary practice sheets or sample 


15 
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ke 
items, where help may be given. The tester should a 
sure that even the dullest testees understand ghal ps 
and that they can manage at least one or two 
sample items by themselves. . indi 
10. Coaching.—No practice or coaching of any Кш 
beyond what is Specified, should be given. Testee 


: om- 
gain an advantage by such means will no longer be conr. 


ized 

Parable with those on whom the test was standardize 
and the norms will be useless, ime 25 

ll. The ‘ Subjective’ Situation. At the same «i we. 
adhering to the standard objective conditions of {е 8 
the tester should try to obtain, even in a group self is 
favourable subjective atmosphere. If the tester ors 
apprehensive, or bored, his mood is likely to made, К 
itself to а group of children. The testees shoul Мај: 
excited and anxióus, nor, of course, lackadaisical, in oven 
attitude to the test. Teachers should not peer how 
children’s shoulders while they are working to see 
they are getting on ; it inevitably upsets them. ke bad 

12. Testing and Teaching.—Many teachers make d iJ 
testers because the teaching attitude is so MR & 
them that they cannot refrain from drawing mora А » 
pointing out mistakes. A test is not a kind ои е 
children should not be expected to learn anything W. MATE 
ever from it. Rather it is а matter of stock-taking. m 
essential object is to obtain а, record of the testees’ perfor 


is just as good or better, This тау be so, but once реј“ 
sonal opinion enters the sco 


the results again cannot be 
The scoring of most group D 
errors very readily creep in, both in checking the righ 
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responses and in adding up correct checks. It is extremely 
desirable for such tests to be scored twice, preferably by 
different persons. 

14. The Subjective Situation in Individual Testing.—The* 
conditions of individual testing require still more tact and 
sympathy than those of group testing. By means of pre- 
liminary conversation the child should be put at ease, and 
interspersed irrelevancies should keep him interested and 
responsive throughout. If he shows signs of fatigue, dis- 
traction, or resistance the testing should be broken off. 
The good tester develops a kind of ‘ bedside manner,’ and 
avoids formalities. He does not read the oral tests from 
the handbook, but, knowing them by heart, says them in 
а conversational tone. In order to keep the child interested 
he must himself sound interesting. Encouragement should 
follow every response, but some caution is needed in prais- 
Ing incorrect responses, since many children may realize 
when they are wrong and become careless or discouraged 
if there is no criticism nor stimulus to do better. The 
Presence of a third person, particularly the mother or head 
teacher, constitutes a distraction which should always be 
avoided. 

In giving the older versions of the Binet seale, it was the 
common practice among trained testers to jump about from 
One part of the scale to another. This enabled the tester 
to choose the test which seemed to him to be most appro- 
priate to the child's mood at the moment, also to save time 

У grouping together tests with identical instructions 
(e.g. the digit memory tests) ; and it prevented all the most 
difficult tests from coming at the end of the session, Un- 
fortunately Terman and Merrill (1937) have now forbidden 
this procedure. However, their book contains several other 
useful hints on obtaining good rapport, while at the same 
time adhering strictly to the standard instructions and 
Scoring. : 

15. Record Keeping—tThe following principle is com- 
monly adopted by experienced testers : never spend any of 
the child's time in making records of his responses, Notes 


> 
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should certainly be made, not only of test responses, but 
also of manner, speech, ete, All the recording, however, 
should be done while the child is engaged either in doing 8 
test, in irrelevant conversation, or in listening to the 
instructions for another test ; and it should be so unobtru- 
sive that it never constitutes a distraction. Poor testers 
not only waste time in recording responses and looking UP 
the scoring, büt also allow awkward gaps to occur whic 

entirely Spoil the favourable atmosphere of alertness 81' 

interest. The reason for this is usually mere lack 0 
familiarity with instructions and scoring. 

16. Impartiality.— While the good tester enters very 
fully into the thoughts and feelings of the child he is test- 
ing, yet he also needs to exert considerable self-contro + 
There are various ways in which he may (quite unwittingly) 
give illegitimate hints, for instance, by over-emphasizing 
the crucial words in an oral test, or by making emphati¢ 
movements in the direction of the correct blocks, h ~ 
ete., of a performance test. Further, he cannot help bu 
feel more sympathetic with some children than with others: 


He should, therefore, frequently ask himself: “ Am I in. 


any way being prejudiced by my liking for this child ; а 
I giving hints or crediting him with doubtful passes 
because I believe that he is bright, which I would not do 
if I believed that he was dull 2 

17. Studying the Whole Child.—The tester should never 
regard his job merely as measurement of the I.Q. or other 
test scores (although many doctors, teachers, and others 


sideration. Although, as we have seen, he is not justified 
in analysing a heterogeneous test like the Binet scale into 
tests of hypothetical diserete abilities (memory, verbal: 
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practical, еђе.), yet he can pick up innumerable hints 
regarding the child's intellectual and emotional character- 
isties, which can later be explored more thoroughly. Sug- 
gestions of specific educational difficulties may readily be 


followed up. Bühler (1988) shows that responses to the · 


Ball and Field (or Purse and Field) test may give valuable 
diagnostic indications. Other Binet items and, still more, 
performance tests like the Porteus Mazes, may be equally 
revealing (cf. Vernon, 1987a ; Earl, 1989). These deduc- 
tions should, of course, be checked later by comparing 
them with óther sources of information, such as the 
Dsychiatrist's case-record, when available. In this way а 
tester can always be extending his methods and improving 
their validity. 

18. Interpretation of Test Results : Unreliability.—No 
mental test score should ever be accepted at its face value, 
Dor trusted in the same way as physical measurements are 
trusted. Even the best tests, it should be remembered, 
only measure to within а certain Probable Error. In 
Other words, a test score should be regarded as the centre 
of a range of possible scores. It may be correct to within, 
say, 5%, but also it may be in error by as much as 20%, 
unless it has been confirmed by the results of other tests 
Which narrow the range of uncertainty. 

19. Units ој Measurement.—The imperfections of men- 
tal units should also be borne in mind, for example, the 
alence of M.A. and E.A. units (cf. p. 84). 


lack of equiva Eu [ 
The I.Q. is still more open to criticism. ТЕ is misinter- 


preted by a very large proportion of testers, and by almost 
all teachers and laymen. ТЕ is not an index (even an 
inaccurate index) of the amount of intelligence that a child 
The M.A. is such an index. The I.Q., 


actually possesses. 
sure of rate, namely, rate of mental 


being a ratio, is a mea 


development. The point may be brought out by the 
following example of the test scores of two children, 
A and B. 


A: А . СА.9:0 МА. 11:0 1.0. 122 
B: 2 . СА.8:0 M.A. 10:6 Id 181 


о 
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Here it is А, not B, who is the more intelligent, since E 
can do more difficult intellectual tasks, and can be a 
pected to be superior in school work. It is true, how 
that B is growing more rapidly in mental powers than p 
that in about another four years he will catch up Me 
and that he will probably be, 45 an adult, the more in 
Bent of the two. E ог. 
In actual practical Work of educational guidance a 
emotional treatment, the tester need hardly both 
calculate the I.Q. or E.Q.s, since what matters is the pd 
level of the child's intelligence and educational сарын edt 
The only proper use of the I.Q. is for predicting а е 
future career, i.e, for assessing approximately his ulti 
educational or vocational level.. aive 
20. Dependence of 1.0. on the Test Employed.—The ваш 
tester or layman usually assumes that a child would 9 ‘at 
the same I.Q. whatever the test employed. This is a ре 
from true that the test employed should invariab dà 
quoted along with an M.A., E.A, IQ., or E.Q. iven, 
reasons for these discrepancies have already been £ 
and will merely be summarized here. t that 
(a) Chance errors of measurement, PES the fac 
по test is perfectl reliable (cf. pp. 145—9). "ecu 
(b) Many tests dre not di standardized. For inst 
American “group tests may give results which are 
lenient (ef, pp. 80-1). а 1005 
` (c) Different intelligence tests have different dispersia 
of M.A.s and І.9.5, so that one test makes a bigin Sd A 
out to be much brighter, a dull child to be much du 
than does ancther test (cf. pp. 85-6). Л 
(d) АП test scores i E: be liable to distortion РУ 
various unknown group factors, or by emotional or ol A 
influences, about which we possess little exact knowledg' 
and which we are not able to control (cf. pp. 190-5). dis- 
In view of this uncertainty over the norms and the ter 
persion of group intelligence tests, the inexperienced d 
would be well advised never to calculate individual I. A 
from them, but to treat their results in purely relativ 
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fashion, i.e. in the way which was recommended for educa- 
tional tests (cf. p. 82). The class teacher can obtain from 
а group test practically all the information which it is 
capable of yielding if he or she simply finds which child in 
the class has the highest score (corresponding to the highest 
M.A. which the next highest, and so on. When test 
results are thus confined to rank orders, they are far less 
likely to be misintérpreted than is common at present. 

21. Conclusion.—Mental tests have been widely, and 
often unfairly, criticized. But the progress of testing is 
Probably hindered to a greater extent by its friends, who 
are ignorant of its limitations, than by its enemies. One 
of the main objects of this book is to analyse these limita- 
tions and to explain the underlying principles which must 
be understood if tests are to be applied and interpreted 
Properly. No one ought to use tests who is not willing to 
familiarize himself with these matters, and with the litera- 
ture which bears on the particular tests that he employs. 
In skilled hands, testing provides а far surer and more 
accurate tool for the assessment of abilities than do sub- 
Jective impressions or the ordinary examination paper. 
But human nature is far too complex to be measured in 
the simple and direct ways in which physical quantities 
сап be measured, and the over-enthusiastic but unskilled 
tester is only too likely to make serious mistakes. 


CHAPTER XI 


EXAMINATIONS 


Functions of Ezaminations.—The chief fet a 
ordinary written examinations may be summarize 
follows : rove 
1. They are employed as tests of achievement, to p rtain 
that an individual pupil has, or has not, acquired a ce 
amount of knowledge of a subject. for 8 
2. They are given to a whole class, or school, f the 
Similar purpose, or for assessing the efficiency o Bo 
teachets whose Pupils are examined. Although pari. 
by results > has disappeared, schools are still comm tion 
judged by the public on the basis of their examina to 
successes: And both head teachers and inspectors try 


: е 
8. Many of the most important examinations Rr 
assumed to possess a prognostic function, i.e. to be сара ta 
of predicting future achievements. Thus they are e 
bar the gates between the primary and the secon! it 
School, between the secondary school and the university: 


4. Many qualities besides achievement, or promise of 
achievement, in particular scholastic subjects, are believe 
to manifest themselves in examination success. The 
business employers who insist that their clerks shall have 
passed Matriculation or the Leaving Certificate do nof 
actually desire employees who are good at Latin, biology: 
and the like. When asked why they attach such import- 
ance to academic examinations, their answers show that 

. 220 
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у 
they hope, rather, to acquire clerks of good general all- 
round ability in non-academie as well as in academic sub- 
jects. They say also that success at an examination 
requires a certain amount of perseverance and industry, 
some docility to discipline, and calmness under pressure ; 
and that these qualities of character are desirable also in 
their employees (cf. Harding, 1988). The significance 
attached to a university degree is often almost as varie- 
gated. Employers are probably quite as much to blame 
for the stranglehold of the contemporary examination 
system as are the teaching profession and the education 
authorities. 

5. Another function of examinations whose existence is 
seldom explicitly admitted is that they constitute an incen- 
tive for stimulating pupils and students to work, particu- 
larly in the secondary school and university ; also for 
keeping teachers up to the mark. Indeed, some would 
Maintain that the prospect of having to do examinations 
must still be employed as one of the chief motives in educa- 
tion, even if their results are entirely worthless. Clearly, 
then, examinations are an essential feature of the whole 
educational system which relies largely upon extrancous 
motives, such as competition, fear of punishment, and 
blame. And they can hardly be discarded until the more 
Progressive ideal of appealing to the pupils’ or students’ 
Spontaneous interests becomes а practical reality. 

6. Many American writers (e.g. Kandel, 1986) believe 
that the only legitimate function of examinations should be 
that of guidance. By checking up the results of his instruc- 
tion the teacher can see where he has failed to make some 
topic clear, and can improve his methods. By examining 
a pupil’s accomplishments he can both diagnose the pupil’s 
weak points which require special attention, and can direct 
his endeavours along the lines for which he is most suited. 
Secondary schools and universities should apply examina- 
tions to entering candidates, not so as to keep out all but 
a select few, but to discover the fields of work in which 
each candidate is likely to do best, and should dissuade 
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from entering only those who could profit from no kind of 
advanced Study. From so democratie a conception of 


Yet far removed. 

Unfortunately, the traditional written examination fails 
Very badly in its attempts to fulfil these many different 
functions. We will consider first the criticisms which have 
been levelled against it, and later ask whether other 
measuring instruments would serve the purpose better, oF 
whether its adequacy can in any way be improved. | 

Criticism of the Effects of Examinations upon Education. 
—The first and commonest complaint is that examinations 
dominate and distort the whole curriculum. Although 
scarcely 20% of the primary and preparatory school popu- 
lation enter grammar schools, and only 295 eventu- 
ally reach the university, yet every pupil has to work as} 
his ultimate objectives were a School Certificate and 2 
university degree. The proportions in Scotland which 
attain these higher stages of education are more than 
double those in England, yet'the same harmful effects are 
found. No teacher or school can afford to give to the 
majority of pupils the knowledge which would be of most 
value to them in after-life because, as we have seen, bot 
are judged by the results of the minority who must sit 
examinations concerned almost exclusively with formal, 
academic subjects. Even the infant schools tend 10 
become preparatory departments for training in the three 
R's. Homework is set at fer too early an age. Brighter 


successes as possible may be Secured in the special place 
examination. | Similarly, for the sake of university scholar- 
ships, narrow specialization occurs in the secondary and 
public schools. 

Furthermore, these examinations stimulate an un- 
healthy, competitive spirit in children, and at all ages they 
encourage the cramming of set books and rote memoriza- 
tion, rather than education in its best sense. Doubtless 
they act as an incentive (Function No, 5), but not an 
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incentive to learning for its own sake. Indeed, examina- 
tions are an inefficient type of incentive, since most pupils 
and students regard them as too remote to bother about 
until a few weeks or days beforehand. А few candidates, 
however, tske them much too seriously, or are forced by 
their parents and teachers to over-work; with the' result 
that many of the maladjusted children who come to 
Psychological Clinies for advice and treatment show signs 
of this examination strain. р 

Р Apparently the only reason, apart from mere conserva- 
tism, which is put forward to excuse this state of affairs, 
is the old fallacy of formal discipline. It is said that 
although the passing of examinations in academic school 
Subjects is of no direct value to 80% or more of the popula- 
tion, yet the pupils' minds are trained thereby so that they 
become better able to learn other subjects, to reason, to 
Write good English, and so forth. A large body of experi- 
mental evidence not only proves that this spread of training 
in one field of study to other fields is usually extremely 
limited in extent, but also suggests that the very opposite 
may occur. The system may indeed train pupils and 
Students in the best methods of passing examinations in 
other subjects, but it is more likely to stunt ' reasoning 


power owing to its emphasis on mechanical memorization. 


Often it produces a dislike of everything assceiated with 


School, and so discourages the desire for further education. 
Tt does not help to develop writing of good English, since 
the English style in. which most examination answers are 


written is known to be, on the average, markedly inferior 
to the style in which the same pupils can write compositions 
at leisure. 

INADEQUATE RELIABILITY OF EXAMINATIONS 


We will not, however: dwell on these points since, for the 


purposes of this book, it is a more serious matter that most 
examinations are Very defective methods of measuring the 
achievements or the future promise which they set out to 


mensure ЕСШ chief, though not their only, flaw is 
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inadequate reliability. The following facts are well 
established : AS 

(а) When examinees sit two different examination 

dealing with the same scholastic subject, both set = 
marked by the same examiner, they are likely to obta 
distinctly different marks on the two occasions. ее 

(5) When a single set of Scripts is marked by two differe f 

examiners, the resulting marks are liable to show Wi 
discrepancies, 
he main causes of poor reliability are : Feil 

(1) Actual changes in the examinees’ mental A 

physical states which affect their examination pote 

(2) Inadequate sampling of the examinees’ knowle 

and ability by the particular questions set. . ted 

(8) Inconsistencies in the standards of marking adop a 

by different exariiners, or by the same examiner 
different Occasions, iners 

(4) Differences of Opinion between different examin 
regarding the relative merit of the examinees’ нр. E: 

1. Changes in the Evaminees.—This was described а ie 1 
(р. 148) as * function fluctuation. Two closely pe 
examinations sat on consecutive days will always giy 
somewhat divergent results, because some examinees nd 
be feeling more fit, or in a better frame of mind, at the le- 
of the twosothers at the second. Many external and d 
vant happenings may temporarily depress or improve the 
work of a few individuals. Over longer intervals a 
examinees’ abilities will naturally show considerable ден 

lopments and alterations. Оп the whole, however, t 
uncertainties occasioned by such factors are probably 
smalier and less important than those due to the other 
three causes. 

2. Inadequate Sampling. тъ, factor is responsible fot 
the common complaint among examinees, namely, that the 
questions ‘did not suit them, that on another set O 
questions they might do better (or worse), A pupil’s 
capacities in the various Special branches or aspects of 8 
subject are always somewhat uneven. “And no examina- 
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tion is eapable of bringing out everything that the pupil , 
can до in that subject, or of measuring all his strong and 
weak points. Hence, the half-dozen or so questions which 
he answers are never more than a sample of the questions 
which would be needed to cover the whole field. Com- 
mon sense, as well as statistical principles, tell us that the 
more lengthy the examination, and the greater the number 
of questions, the more representative will the sample be, 
and the better will the ground be covered. Thus in 
Certain university honours examinations where the 
students’ marks are based on twenty-four hours of exami- 
nation work, their marks would be unlikely to alter to any 
great extent if they sat another similar series of papers. 
But when, as in the special place examination, the pupils 
only take one short paper in English and one in arithmetic, 
quite large alterations might easily occur in further parallel 
Papers. The typical correlation coefficient between тези 
Оп ordinary two-hour papers.in the same subject is known, 
from many experiments, to be no higher than + 0-70, 
and often Jess than this. We can therefore predict, by 
applying the Spearman-Brown formula (p. 147), that 
examinees would require to take at least four such papers 
before the ground was sufficiently covered for the reliability 
coefficient to rise to the fairly respectable figure of + 0-90. 
_ Suppose that one hundred pupils take such an’examina- 
tion, with a reliability coefficient of + 0:66; and that 
their range of marks is from 80 to 90, their average 60, and 
their standard deviation 12 marks. The P.E.estim. of 
an individual mark will then be approximately + 6%. 
Now if this is an examination in which the passing mark is 
fixed at 70%, we can prediet that all those who scored 
70 to 75 have an even chance of scoring below 70% on a 
Second similar paper, and that those who scored 64 to 69 
have an even chance of gaining 70% or more on a second 
paper. 'This means that only 10 of the pupils would 
pass according to both papers, 61 would fail on both, and 
as many as 29 would pass on the first and fail on the 
second, or vice versa. This hypothetical, but only too ) 
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typical, illustration demonstrates the unfairness which 
inevitably results from the lack of thoroughness of mid 
examinations, and in particular the difficulties which Y 
when the examination is Supposed to divide the candida 
into two distinct categories at an arbitrary pass -— val 
An additional reason for the variability of a pupil’s tha 
in answering different questions or sets of questions is lish 
the majority of examination. answers consist of Eng * E 
compositions or essays. (This does not usually арр E 
Papers in mathematics or in foreign languages ; but it 12) 
to English, history, geography, science, and other sul 
And the English essay is known to be a particularly on 
able production. The compositions written by а pu 
weekly throughout a school year fluctuate widely in fuse 
merits. As Ballard (1928) points out, teachers may " ae 
to accept this statement, and claim that their Pope m 
fairly consistent from week to week. But the fact t e 
teacher marks апу one pupil's work on a dead level p reo- - 
ably indicates that, unwittingly, he has formed à ste the 
typed conception regarding his ability, and is marking di 
successive compositions not on their own merits b ut "i 
light of his knowledge of the pupil's average -—À АН 
The truth of this Suggestion is borne out by the m odi 
Stories of the two pupils who regularly received, one a до еп 
the other'a bad, mark, and who continued to do so eV 
when they exchanged their essays. of 
` An outside examiner does not possess this knowledge 


deviations are likely to be exaggerated. Probably much 
more representative products would. be forthcoming if the 
examination could be answered at leisure, under conditions 
Pply when pupils do written wor 

at home, or at school. 

` 8. Divergent Scales of Marks —This factor has already 
been discussed fully in Chaps. II and ТУ, hence we will not 
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pause to re-argue the illogicality and injustice of а system 
Which permits a far higher proportion of passes, or a higher 
average mark, to be awarded in one university or School 
Certificate subject than in another subject. Ifa mark of 
80%, or a first class or credit, signifies to one examiner the 
Standard achieved by a quarter of the examinees, and to 
another the standard reached by only 1 % of the examinees, 
then it is obvious that this mark means different things 
to different people. And it follows that an examinee's 
mark will vary with the particular examiner who happens 
to read his papers, unless all the marks given by all the 
examiners are uniformly standardized in accordance with 
the principles outlined above. 

_ The standards of marking even of a single examiner are 
liable to alter with fatigue and mood When marking 
large numbers of answers to the same’ question, he may 
begin very strictly and gradually become more lenient, or 
Vice. versa, The same script might receive а different 
mark if read after, instead of before, dinner. 

. 4. Differences in Opinion as to Relative Merit.—The most 
important source of unreliability (which is usually found 
in combination with No. 3) is the personal element which 
enters into every examiner's evaluation of a script. We 
Will first summarize some of the evidence regarding this 
subjectivity. The classical experiments were ёаггіе out 
in America as far back as 1912. Starch and Elliot (1913) 
Sent copies of a single geometry paper to the chief geomet?y 
teachers of 116 high schools, with the request that they 
should mark it in accordance with their usual practice. 


The marks awarded varied all the way from, 28% to 92%.. 


Two of the teachers gave it more than 90%, 18 gave it 
between 80% and 90%, 18 gave it between 60% and 30%, 
and 2 under 30%. Similar results were obtained with 
papers in English and in history. Equally striking is the 
anecdote related by Wood (1921) of the papers which were 
marked in the usual way by six examiners. The first 
for his own guidance, wrote out a set of (what he 


examiner, b 
) model answers to the questions ; but he un- 


regarded as 


" 
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fortunately left this among the candidates' papers. He 

was subsequently awarded marks varying from 40 % to 90% 

by the other five examiners! Many experiments, Con 

ducted under far more strictly controlled conditions than 

these, have substantially confirmed their findings. Euro” 

pean educationists, however, were so little impressed ÞY 
them, that it was necessary for further elaborate studies 10 

be carried out recently in England and in France, under 

the auspices of the International Examinations Enquiry 

Committee, 

Hartog and Rhodes (1985) arranged for typical groups _ 
ОЁ scripts to be marked by experienced examiners un E 
normal examination conditions. Special place, Schoos 
Certificate, and university honours examinations va 
represented, and several different subjects were studie E 
Generalizing from a number of their results, it appel" 
that when helf а dozen examiners independently mar 
Set of scripts, the lowest 2nd highest marks given to any 
one script differ on the average by about 10 to 12%. Amon 
a few of the scripts the discrepancies may be quite Sm?" 
but among others they may be far larger, often gre? 
than 20%. The authors show that part of the divergent 
may be ascribed to the different standards of leniency 
adopted by different examiners (No. 3, above), but the 
even when this factor is eliminated, large variations remain" 
In several of the experiments the examiners drew uP ? 
agreed scheme of instructions for marking, listing both ш 
general characteristics and the detailed points which vier 
to be looked for in the scripts. In the marking of Englis 
compositions, where this was scarcely possible, the incon” 
sistencies were much greater, 

Now Hartog and Rhodes doubtless desired to shock thé 
public conscience with regard to examinations, and so 9! 
their best to emphasize the disagreements. The presen 


mark. 


| 
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Writer is more impressed by the smallness than by the" 
largeness of the discrepancies, and would conclude from his 
study of the results that important publie examinations, 
like the School Certificate and university honours, are 
marked much more consistently than is usually the case. 
Instead of laying their chief emphasis on the extreme dis- 
crepancies, the authors might have pointed out that the 
median disagreement between any two examiners is not 
more than 3% in the best-conducted examinations. Since 
this figure represents the Probable Error of an examinee’s 
mark, a few examinees will naturally show discrepancies 
four or five times as large. It represents an average 
correlation between different examiners of + 0-80, when 
the standard deviation of their marks is 10%. And 
decidedly lower figures have been obtained by other investi- 
gators. The authors also omit to point out that the 
reliability of an examinee’s final total mark is generally 
very-much higher than their results indicate, since it may 
be based on half a dozen or more papers, each marked by 
two or more examiners. Their results mostly refer to 
single three-hour, or two two-hour, papers. 

Nevertheless, this study deserves to be taken very 
Seriously, since it indicates that the special place, and 
other less thorough, examinations are deplorably unreli- 
able ; that in the absence of a scheme of instructións drawn 
up and applied by experienced examiners, much worse 
discrepancies may arise; that when the average and the 
dispersion of marks are not standardized, gross differences 
may appear in the proportions of Credits, Passes, Fails, 
ete., which are awarded ; and that even a P.E. of 3% 
may frequently make all the difference between a Pass 
and a Fail, or a First and a Second Class. 

Another example will be quoted which better illustrates 
the subjectivity of marking under ordinary school condi- 
tions. Boyd (1924) collected large numbers of essays on 
* A Day at the Seaside ’ from 11-ycar-old Scottish school- 
children (i.c. children at the qualifying stage). From these 
he sclected 26 as representing the whole range of merit, and 


16 


› 
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had them marked independently by 271 teachers. The 
marks were not numerical, but consisted of seven grades, 
Tanging from Excellent to Unsatisfactory. On comparan 
the results of different markers, it was found that almos 


Excellent; 31 үс +, 80 VG, 125 G+, 28 G, and 1 
marked it Moderate (the lowest grade but one). Boy 


and the 20 % most severe were excluded, the average e 
was still awarded at least 4 out of the 7 grades. Неа E 
points out that the average grade given to an essay Бу 
large number of markers is highly consistent, and that d 
various divergent individual grades are distributed aon 
this average more от less iri accordance with the norm 
distribution curve. -dicate 
Other experiments, including one of Hartog’s, ыйсы 
that when the Same examiner re-marks a set of scripts af d 
an interval, the discrepancies between his first and E 
opinions are likely to be almost as great as those pet 
two independent examiners. Some of the studies by | 5 
French members of the International Examination 


Scientific training. When the markings were compared, 
the average correlation coefficient between the scientists 
was + 0-683, whilst the correlations between the student 
and the scientists averaged + 0.5]. The latter figure 15 
lower than the former, but so little lower that it casts some 
suspicion on the value of an examiner's knowledge of his 


Particularly striking was а stud 
examination in the Baccalauréat. 
were marked by six examiners, 
were passed by all the examine 
them. Тће remaining 71 were 


Y of the philosophy 
One hundred scripts 
Only 10 of the candidates 
7$, only 9 failed by all of 
Passed by some, failed by 
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others A It is, of course, always the examinees of 
medium ability whose true standing 15 most uncertain. 
The few best or few worst are recognized as such by all 
examiners. But only too frequently an arbitrary dividing 
line—the pass mark—has to be erected in the middle of 
this region of maximum uncertainty. Undoubtedly, 
therefore, a large number of candidates who just succeed, 
or just fail, to gain a School Certificate or a special place 
might very easily obtain the opposite result if they were 
marked by a different examiner, or if they sat an alterna- 
tive set of papers. In many instances their whole educa- 
tional or vocational career may be blighted owing to the 
unreliability of their marks. 1 
Reasons for the Subjectivity of Marks.—The prime reason 
for the discrepancies which we have described is that 
different markers conceive differently the desirable or un- 
desirable characteristics of examination answers. Та other 
Words, they attempt to assess many diverse types of 
abilities. One examiner may give chief credit for evidence 
of work done, and for the comprehensiveness and accuracy 
of the facts contained in the answers. Another may look 
rather for signs of future promise, originality, and grasp of 
general principles. One may insist on clarity of expression 
of ideas, another may try to assess the profoundness of the 
ideas irrespective of their good or bad expression. Some 
May search for emotional rather than strictly intellectual 
qualities, such as the examinee’s interest in the subject’. 
enever a candidate states his attitudes or opinions, 
these are liable to coincide, ог clash with, the attitudes or 
Opinions of the examiner; and however „desirous the 
examiner may be of maintaining impartiality, he is liable 
to bias if his pet theories are approved or attacked by the 
candidates. Often, if he would take the trouble to formu- 
late explicitly his notion of the aim of the examination, he 
would find that he means by a good examinee the one who 
closely approximates to himself, and as а poor candidate 


ests that the correlation between any pair of 


1 This result su 
is res ая + 0-60. 


examiners averaged 
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the one who Shows none of the intellectual or emotions 
qualities which he himself idealizes. Is it to be wondes а 
then, that different examiners differ in their opinions o 
candidate ? Е ап 
We must recognize that any one script may manifest Fn 
extreme complexity of features, good and bad, and so o 
be looked at from a multitude of different points of Mc А 
Its merit can never be assessed in the same way that 


expression, the mechanics of its grammar or computa a 
‘and so оп. He attaches a certain weight to each of t hly 
possible components (a weight which is likely to be кы, 
individualistic, unless a detailed scheme of ape te’s 
been adopted), and finally re-synthesizes the candi "hd 
perforniance on each component into a single mark for t 
answer as a whole. Some examiners do not пер. d 
Systematic analysis of this kind, and merely m M 
basis of a general impression. But as Ballard (цо ж, 
Hamilton (1929) show, every general impression invo 
more or less vague analysis. i 
Even in ak vibes, as arithmetic, or the transla аи 
of foreign languages, the weightings assigned to the van to 
analysable features of ап answer, and tlie penalties given. Я 
various possible mistakes, are liable to differ among the 
ferent markers. But the complexity is far greater in al 
case of English composition, or in those subjects such d 
history, science, ete., where the answers consist chiefly гу 
compositions. · The essay-type of answer has been VÉ 


`опе school, wh 
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evidence of the examinee’s good or poor ability. Investi- 
gation shows that the average examinee puts into writing 
less than one fact or idea per minute of examination time, 
since the process of expressing these facts or ideas takes so 
long. The examiner, then, has the still more difficult task 
of translating the product back again, of trying to pene- 
trate through the verbiage to the signs of ability and 
knowledge beneath. Few examiners can remain entirely 
uninfluenced by the literary style or the handwriting of an 
answer (there is direct experimental proof of this), although 
they are presumably trying to mark the product for its 
historical or scientific merit, not qua essay. 


INADEQUATE VALIDITY OF EXAMINATIONS 

The poor reliability of examinations greatly reduces 
their value for any and every purpose. > For if they cannot 
even measure examination success without seriovs errors, 
they naturally cannot validly predict anything else. Even, 
however, if perfect reliability were established, validity 
would still be diminished by the dependence of examina- 
tion success on factors other than ability, knowledge, and 
industry. 

One such faetor is, obviously, the efficiency of the 
examinee's teachers as examination coaches, and their 
cleverness in guessing the kinds of questions which ex- 
aminers will set. It is manifestly unfair that a pupil at 
ere the teachers are specially adept "at 
instilling examination knowledge, should succeed when a 
pupil of equal ability fails because he attends another 
school, where the teachers are either less skilled, or more 
humane in the education which they provide. 

There is also the old complaint that some pupils are 
© good examinees,’ while others become unduly ‘ nervous,’ 
and ‘ fail to do themselves justice.’ р One wonders whether 
it is not usually the weaker candidates who make this 
complaint, and whether they would do better if subjected 
to any other kind of test, including the tests of real life. | 
However, we may admit that certain temperamental 
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qualities, which we сап as yet hardly specify, together 
with good physical health, may confer some advantage 0n 
certain candidates, which other candidates lack. 

Examinations, then, certainly fail to fulfil the first two 
functions with which we started, namely, the measurement 
of achievement, both because of poor reliability and 
because of the other considerations just mentioned. How 
imperfect they are we cannot determine. We can, how- 
ever, study their validity as measures of future ability by 
directly correlating their marks with the results of the same 
Pupils at later stages in their careers. Comprehensive 
investigations into this matter have been carried out by 
Valentine and Emmett (1932), with almost uniformly dis" 
heartening results. By comparing the subsequent secon” 
dary school achievement of groups of pupils who had bee? 
awarded, or who had failed to obtain, scholarships at the 
time of the special place examination, it was found t at 
many of the latter ultimately surpassed the former grouP: | 
Similarly, at the university level, Valentine found that 
roughly two-fifths of all those awarded scholarships failed 
to justify their promise, since they eventually obtaine 
only third-class honours or pass degrees. They wer? 
beaten by as many as one-third of the non-scholars, wh? 
achieved first- or second-class honours. Of those who wo? 
scholarships given by the universities, only about one 
fifth failed to live up to expectations, of State scholars only 
one-tenth. But this meant that practically one-half 0 
those who obtained other types of scholarships (from the! 
schools or from local authorities) did not justify themselves 
(cf. Valentine, 1938). 

In a further study by the Scotti: i esearch 
in Education (1936), marks rn ai alis school 
pupils at the Leaving Certificate examination together 
with teachers' marks and the head masters’ estimates о 
the pupils' general standing, were correlated with subse- 
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than those who got second class, and that honours students 
had been rated higher than ordinary degree students; but 
that, as in Valentine's research, there were а great many 
exceptions. The actual correlation coefficients between 
School and university marks varied from about + 0:30 to 
+ 0:70, and the average of several such results was + 0:48. 
In the light of our discussion of correlations (Chap. VIL, 
this figure implies very poor predictive validity. 

It should be remembered, however, that the above in- 
vestigations (together with most of the studies of reliability) 
dealt with highly selected groups. A random sample of 
the population might be expected to show 8 range of 
ability at least twice as great as that of university students. 
This means that, if it were possible for a random sample to 
receive a secondary. and university education, and to take 
both the Leaving Certifieate, ог eritrance scholarship, 
examinations arid the degree examinations, the coefficient 
of + 0-48 might rise to + 0-75 or more (cf. Р. 140). 

It would be unreasonable to expect perfect correlations 
between marks at successive stages in pupils’ careers, since 
the subjects studied at a secondary school differ distinctly 
from those studied at a primary ог preparatory school, even 
when called by the same names. ‘And the conditions of 


work at a university differ from those at а school, being 


more congenial to some, less stimulating to others. Again, 


every individual changes to some extent as he grows older, 
some of his abilities becoming relatively weaker, new ones 
developing. All in all, then, the inability of entrance and 
scholarship examinations to choose those candidates who 
will profit most from higher education is not very surprising 
to the psychologist. Yet their validity for this purpose is 
certainly much poorer than is generally supposed by the 
educational authorities who set the examinations, and who 
tions upon them. The universi- 


base their awards and selec 
ties would scarcely continue to dictate the present constitu- 


tion of the School Certificate, Leaving Certificate, and | 


Matriculation examinations, if they did not believe that 


these instruments picked out the most suitable entrants. 


( 
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Still more unjustified is the belief embodied in the fourth 
of the functions outlined above (p. 165), namely that these 
examinations are adequate criteria of vocational fitness. 
It is indeed true, as employers assert, that successful candi- 
dates will on the average be better in general all-round 
ability than those who have failed to gain the Certificates, 
since all Scholastic abilities are to some extent positively 
correlated with intelligence. Tt is probably true, also, that 
vocationally desirable temperamental and character quali- 
ties may aid in examination success. But so do a great 
many other factors, Some pupils with extremely undesir- 
able characters may do well at examinations through 
intellectual brilliance ; others through efficient coaching 
on the part of their teachers, or through over-cramming » 


temperamental traits; and would give each of these 
qualities its due weight in deciding on the suitability of 
prospective employees, Undoubtedly the adoption of а 
similar scientific Study of each individual candidate for 
secondary or university education would lead to many 
more just selections, an More accurate predictions of 


future achievement, than do the Present methods. 


Although this chapter has consisted mainly of destruc- 
tive criticism, certain hints h 


СНАРТЕВ ХП 


RELATIVE ADVANTAGES OF NEW-TYPE AND 
ESSAY-TYPE EXAMINATIONS 


E has been widely claimed that the worst defects of the 
raditiona] type of written examination may be overcome 
Y adopting in its place the objective or new-type test for 

measuring achievement. Such tests came into common 

Use in the United States many years ago, but are still little 
nown in this country, in spite of Ballard's (1923) able 

advocacy . Americans, however, realize now that they also 

ате liable to certain defects when used undiscriminatingly, 
апа that there are still functions which can better be per- 

Ormed by the old- or essay-type of examination. Thus 

We are in a position to profit both from their researches 

and from their errors, to devise our tests in accordance 

With the principles they have already worked out, and to 

apply them to the fields of educational measurement where 

they will be most useful and most valid. 3 

Now it is clear that some educational products can be 

Marked much more objectively and reliably than others. | 

In the early stages of school arithmetie, when children are 

Biven a series of short sums to do, there is no subjective 

element involved in deciding which ones, and how many, 

they do correctly. But before long arithmetic becomes 

More complex; each sum involves а number of different 

Operations, partial credits are essential, and different 

markers are now liable to differ. Ап English composition, 

Or an examination answer written in essay form, brings in a 

Still greater personal clement, since it is still more complex 

ànd cannot ever be considered merely from the point of 

View of right or wrong. The new-type tester argues, 

therefore, let us refrain from setting sums or essay questions 

237 
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which involve many operations, and instead ask questions 
Which evoke each operation separately, questions to which 
only one right answer is possible. And instead of leaving 
to the marker's personal taste the analysis which must і 
made of а complex product, and the weighting which must 
be attached to its component elements, let the person who 
sets the questions make the analysis beforehand, 8" 
decide Systematically iust now many marks (i.e. how many 
questions) shall be allotted to each element. E 
Advantages of N, ew-type Examinations.—So we arrive ‘i 
the new-type examination which contains a large number 
of brief questions, instead of a few long ones. Every 946 
tion is so arranged that all markers will agree as to 0 
rightness or Wrongness of an answer. This elimination e 
the personal element is not the only advantage. For : е 
hundred or more questions of the new-type test are cap?" 
of sampling the whole field of knowledge much more E 
prehensively than can the half-dozen or so questions oft Ў 
orthodox examination. Candidates can no longer cn 
plain that the particular questions failed to suit E. is 
Since all their strong and weak points are probed. ч ТЕ 
fact should help to reduce the ‘ spotting ’ and cramming о 
likely questions. Again, the subjectivity of standards ©, 
marking need no longer cause irouble, since the examine, 
marks are pure ‘ count Scores ’ (cf. p. 24), So long 85 H 
questions are designed to cover all levels of diffi culty 
evenly, the distribution of marks is sure to approximate 
normality, and can readily be converted into any desi 
percentage or other scale, The examiner may, of couts 
draw an arbitrary pass line at any point in the distributio” 
that he wishes, so as to cut off the examinees who fail 


chapter. Here it will þ 
the so-called ‘ open ? or ‹ recall? + 
‘closed’ or ‘recognition > t 
desired answer consists of a 


ype of question an ћ 
“Уре. In the former 4 
Single word, or a few word® 
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or numbers, which the examinee must write either in a’ 
Space provided on the examination paper itself, or on a 
separate sheet. Questions 1-16 on pp. 240-1 are of this 
type. Since, however, there is always some danger of 
partially right or alternative answers being given, whose 
marking would involve subjective judgment, the other 
type of question is more commonly adopted, where the 
examinee has to identify the right answer from among two 
or more answers printed on the examination paper. Either 
he may underline or put a check beside the answer he 
believes to be correct, or else merely write down its letter 
or number. Questions 17—54 on pp. 241-6 are of this type. 

This latter system has the additional advantage that 
Practically all writing is eliminated. So that when 
geography or history is being examined, irrelevant factors 
Such as speed of writing or literary style'do not distort the 
examiner’s marking. It has been claimed that the sverage 
examinee, instead of expressing less than one idea or fact 
Рег minute (as in an essay-paper), can deal with three to 
Six new-type questions a minute. But this naturally 
depends on the level of difficulty of the questions. А 
further advantage is that the examinee who knows scarcely 
anything about the subject, but who bluffs by *free- 
associating ' around an essay-tyPe question, and who often 
gains several marks by doing 50, 15 entirely foiled. Older 
Students realize how much fairer is the new-type test, and 
9n the whole prefer it to the orthodox examination. 
iounger pupils, also, according to several American 
Investi i ioy it more. 

А НАМ EE below, dealing with the subject- 
matter of the previous chapters in this book, which the 
reader is advised to try to answer. It is not meant to be 
à complete examination on mental testing and statistics 
(hence it violates several of the rules of construction given 
in the next chapter). But it introduces several of the main 
Varieties of new-type questions, some easy, some fairly 
difficult, and should give the reader who is not already 
familiar with such questions a better insight into them. 


О 
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Nos, 1-8 are simple-recall type, Nos. 9-16 comp. 
type, Nos. 17-25 true-false type, Nos. 26-36 -— 
choice type, Nos. 37-47 matching type, and n d on 
rearrangement type. The correct answers are liste 
Pp. 290—1. 


ТАТІ" 
EXAMINATION ON MENTAL MEASUREMENT AND S 
TICS. Time: One Hour 


Please read the following directions carefully before gar Mine 
Swer as many questions as possible, in any orc on the 
answers should be written in the right-hand EE. on 
blank spaces provided. Any rough figuring may be 
blank paper, 2) consists 
The answer to every question (except Nos. 87-4 f 
of one letter, word, or number, p; but 4 
pu may guess if you are not certain of an ован Б 
marks will ђе deducted for wrong answers, random g / 
not advisable, 


1. Eight pupils obtained the following school marks : 
C DE F G H 

15 10 12 10 9 12 10 18 en 

What is their median 3^. г G "гш, whal 

2. If the above marks were put into rank order nct LOL 

would B's rank position be 2. : . "for a grouP 

3-5. The following is an extract from the norms fo 


Бете. =. 48 58 68 77 84 89 98 dult 
Mental Age 9 10 11 12 18 14 15апда 


What are the respective intelligence quotients of : 
8. A child aged 8-0 who scores 77? . . 
4. A child aged 10-0 who scores 63? . 
5. An adult aged 20-0 who Scores 77 ? 5 transpose 
6-8. In a certain Secondary school all the marks are dier mar 
into the following standard Scale. A pupil n French: 
of 60%, 47%, and 63% respectively in Engllis ei upils, 
and Mathematics, In his English class there are ГДР 
in his French class 35, and in his Mathematics clas: 


Percentile Mark 
90 . a . 78% 
75 . " + 67% 
50 . x - 60% 
25 . è - 58% 


y U ЖЕ . = ATK 
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What is his approximate place (i.e. his rank position in’ 
the class) in : 

6. English ? 

7. French ? 

8. Mathematies ? 


other words or letters similarly. 

The compensation theory of mental 
organization is disproved by the fact that 
all correlations between mental abilities 


are (9). The results of correlational in- (де Stagione dare 
vestigations also disprove the view that 

the mind is made up of distinct (10). d. 224 
The 'Two-factor theory, proposed hy - 3 

(11), regards each ability as made up of a еее 
Íactor common to all abilities, called à 

(12), and a factor (13) to that ability b esas. 
alone, called s. However, when verba на 

tests, performance tests, and other tests — 
of intelligence are compared, it is doubt- mi M o 


i the 
ful whether (14) will account for all 
Heb dte Residual a qoem oe 
indicate the presence of (15) ме EOS E. 
such as verbal ability, % and рга 16. 


ability (16). 


17-25, Write a + sign in the 
Statements which are true, 
are false. ; intelligence 

ibus test is a group intelligi 
17. Ал. Nerd Senis items of many kinds are 
mixed up dull children are below 


18. A majority th and physical develop- 


ment - а :pplied by psycho- 
are а y psy: 

19. Performance P ке Ка to measure а 
lords practical ability in everyday life ........ 
Со jer to measure the average intelli- 

20. In orde of a class of 7-year-old children, 
genos. her would be best advised to use 
а SUE test with pictorial material . ........ 


in after any of the following 
aA a — sign after those which 
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21. In a normal frequency distribution curve 
the semi-interquartile range and the џ 
standard deviation are identical МА 

22. University students who obtain scholar- 

Ships on the basis of competitive 
examinations as often as not obtain 

по better degree marks than those who Я 
fail to win such scholarships . — . .. 

23. The advantage of Scoring abilities in 
terms of Mental Age or Educational 
Age is that the units of measurement » 
are all equal to one another . MEL 

24. A group of a hundred children picked out 
аз being especially superior in one 
school subject will practically never 
show as great an amount of average - 
superiority in any other subject „ж cet. 

„25. The Probable Error of a pupil's English 
composition mark.is generally likely 

" to be smaller than that of his arith- 
metic mark . я . 


ег 
26—86. Write the letter (А, B, or C, etc.) of the correct ansW 
in the margin. А 
26. Group intelligence tests, as contrasted P. 
with the Stanford-Binet test, are: ^. 
(А) More varied in content а 
(B) Better measures of children's intelligence 
(C) More objectively scored 
(D) "More difficult to apply | 
(E) Less suitable for application to superior 
adults 
27—82. А set of 51 pupils took three examinations, 


A, B, and C. The following distri- 
butions of marks were obtained : 


Mark A B с 
SEHE: а 1 8 
80 +. 21/8 2 5 
78 + . 4 4 7 
66 +. 18 7 11 
59+. LO 13 
52 +. 5 14 8 
45 +. 2 9 4 
98 +. 1 8 2 

51 51 51 
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88. 


34. 


32. In calculatin; 


27. Which of the distributions approxi- 

mates fairly closely to normal, 

„А, В, С, ог Мопе? . : à 

28. Which of the distributions is nega- 

29 tively skewed, A, B, C, or None 1 

. Which of the distributions is likely 

to have the smallest standard 

deviation ? К У н | 

30. A certain pupil happened to get 70% 

on all three examinations. In 

which of them did he do relatively 

best ? А Е » : М 

31. If you were calculating the average 

òf Distribution C, what figure 

would you take as your arbitrary 

mean? . E А E E 

g this same average, 

what figure would be multiplied 

by 2 (i.e. by the number of pupils 

who scored 38 +)? - б У 

A group intelligence test was applied in 

a secondary school, and the occupa- 

tional preferences of all the pupils were 

ascertained and classified under ten 

headings. In order to find whether 

there is any correspondence between 

occupational preference and intelli- 

gence, which of the following statistical 
techniques could best be applied ? 


(A) Biserial r. 

(B) Product mo 

(C) Analysis of variance. 

(D) Contingency coefficient. 

(E) Correlation ratio. 

(F) Multiple correlation. . 
of the following 


Which is the least serious : 
in intelligence quotients ob- 
tained from group intelligence tests ? 

scores ате considerably 


affected by previous practice on 

similar Um dou f 
standard devia ion of grou 

(B) ш T.Q.s differs considerably n 


different tests. 


ment correlation. 
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(C) Few of the published tests are ade- 
quately standardized, or have 
reliable norms, . 

(D) Group tests are generally given with 
2 time limit, and the child who is 
quickest is not necessarily the 
most intelligent, | 

35. Which one of the following statements is 
untrue ? The distribution of marks 
given by an examiner to a class of fifty 
pupils may be very irregular or skewed E. 

ecause : seker 

(A) The number of cases is too small for a 
smooth, normal distribution to be 
expected. 

(B) The examiner’s opinion of the rela- 
tive merits of the compositions is 
a personal one. 

(C) Тће pupils’ abilities may be asym- 
metrically distributed, as when, 
for example, the class contains a 
long tail. , 

(D) The scale of marks which the 
examiner adopts may be lacking 
in any rational basis. 

86. In an experiment on the scores of gradu- 
ate and non-graduate teachers on ver- 
bal and non-verbal group intelligence 
tests, these results were obtained : 


Verbal Tests Non-verbal Tests Р 


Graduates . r . 158-9 78:6 
Non-graduates , . 148-5 77-0 
P.E. of the difference 

between means. х 2-1 12 


Which of the following conclusions may „л 
legitimately be drawn? = eet 
(A) Graduates йо better than non- 
graduates at intelligence tests, 
(B) All teachers do better at verbal than 
at non-verba] tests. 
(C) Scores on verbal tests аге improved 
y taking а university degree. 
(D) Teachers with university training 
have no appreciable advantage 
over others on non-verbal tests. 


37 
-41. The above pairs of curves ( 


S 245 


NEW- 
EW-TYPE AND ESSAY-TYPE EXAMINATION 
F 


F 
LQ. 
(А) 
Е 
(бу ч (р) 


A-D) represent the а roxi- 
dx frequency distributions of the intelligence ШОН 
Evers groups of chil Below are listed several 
uch groups. Oppos write the letter of the 
par of curves which corres, Write * None ’ opposite 

e groups to which no pair of curves corresponds. 

87. One thousand elementary school pupils, 

school pupils ..-. 


and one thousand special 
88. All elementary school pupils an all 
special school pupils in the country . ++... 
all 


89. All elementary school pupils ап 
children in the country - : 5 

40. Allspecial school and all secondary school 
upils in the country . ` 

ал. АЙ elementary school and all secondary 
school pupils in the country - : 


42-47. Below is 


48-54. An arithmetic test, a 
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given a list of psychologists (A-F), зоре 
with a list of important pieces of work in educa noui 
psychology. After each of the latter write the енеси 
the psychologist who was mainly responsible for, or У оз 
you chiefly associate with, this piece of work. ro 
letters may be written if two psychologists were P or 
minently associated. Some letters may be used twice 
more, some not at all. 

(А) Ballard 


42. Development of the 
B) Binet conception of the In- , 
(C) Burt telligence Quotient . ....« 
(D) Drever-Collins 48. Development of a scale е 
(Е) Тегтап of performance tests . +... 


(Е) Thurstone 44. Investigations by corre- 
lations into the factors 
underlying abilities 

- Advocacy of new-type 
examinations . 

‚ Construction of educa- 
tional attainments tests .. 

- Re-standardization of 
the Binet-Simon scale 

handwriting test, and two pur 

ven to an ordinary school са , 

correlations were worked out s 

ence tests with handwriting b К. 

gence tests with arithmetic kr Ж. 

gence tests with chronological ag 

(D) One intelligence test with the other. 

(E) Arithmetic test with handwriting test. 


intelligence tests, were gi 
and the following inter- 
(А) Combined їп еШ} 
(B) Combined intelli 
(C) Combined intelli 


(F) Handwriting 
(G) One intellige 
hich pair woul 
48. The highest 
49. The second hi 
50. The third ? 
51. The fourth ? 
52. The fifth ? 
53. The sixth ? 
54. 


d you 


Drrecrs or New 


So far only the advanta 
h 


correlation ? 
ighest ? 


The lowest correlation ? 


test with chronological age. 
nee test with arithmetic test. 


expect to yield: Р 


-ТҮРЕ EXAMINATIONS 
ges of new-type examinations 


ave been mentioned, and we must now consider criticisms 
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ч т їп detail. Two minor objections will be admitted ^ 
= B F irst, the cost of printing or duplicating а new- 
p" is considerably greater than that of an essay-type 
m ination. For classroom purposes, however, oral 
и ошоп of the questions can often be adopted. If 
the ral groups of pupils or students are to be examined, 

y can write their answers on blank sheets, and their 


test papers can be used several times over. 
corny, much greater time and skill are needed for 
ing a good new-type test. The major criticisms 
aes are to be discussed below apply only too often to 
erior tests which have been rapidly constructed by inex- 
E cared amateurs. It is highly desirable that teachers 
t ould be trained in the principles of testing during their 
raining college courses. The time needed in setting а 
Paper is, however, more than offset by tħe time saved in 
marking, when the number of examinees is large. Irdeed, 


Since the marking should be fool-proof, there is no reason 
lerks or secretaries. In 


Why it should not be done by с 

arge public examinations, the authorities,could save much 
expense by paying professional examiners only to set, not 
to Correct, the papers. The present writer finds that the 
Construction of а onc-hour's new-type test in educational 
Psychology (such as that appended above) takes him about 
twenty hours. Setting ап essay-examination "scarcely 
Needs one hour, But in marking these one-hour examina- 
tions he can easily do thirty of the former as against ten or* 
less of the latter scripts per hour. He, therefore, finds no 


Saving i ; of work unless the number of ex- 
Ing in total hours 0 But the setting of 


amince Р 800 ог morc. 

а ede n Ma ELT occupation, which can be 
One in odd moments throughout the year; and the 

Marking is simply а routine matter | which involves no 

Mental strain. Ву contrast, the marking of large инв 

9 essay-type scripts in psychology 158 Meroen gung work: 
at о. 5 

minces may obtain а certain number of correct answers 
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(in 2 recognition-type test) by pure chance guessing. lf 
they filled in answers at random they would, on the 
average, achieve half-marks on the questions where two 
responses are provided (such as the true-false type), and 
quarter-marks on the questions where four responses ате 
provided. This fact disturbs the teacher far more than it 
does the psychologist. For the latter realizes that every 
examinee has the same chance of scoring, let us say, 35%» 
by random guessing ; he therefore regards 35% as the 
effective zero point and only gives credit to scores above 
this level. An alternative procedure, which comes 10 
much the same thing, but is fairer in that it docs not 
penalize the examinee who omits questions rather than 
guess them, is to subtract the total wrong responses in true- 
false questions from the total right ones; to subtract half 
the total wrong in three-response questions, one-third in 
four-response questions, and so on. That is to say, ай 


examinee’s score corrected for guessing = R — 910 
where R is the total right, W the total wrong, and n is the 
number of alternative responses provided in each question: 
In general practice the correction is omitted in four- 07 
more response questions, since it is so small. i 
But the, problem of guessing is a complex one which 
cannot be entirely settled by this mechanical correction. 
ТЕ has aroused widespread eontroversy, and a long discus 
sion of it may be found in Ruch (1929). Although 29 
above correction will compensate for the effects of guessing 
in the average examinee, yet it will not do so in all cases; 
since some wili be luckier than others, in accordance with 
the principles of probability. Suppose that a large grouP 
of persons guesses the responses to ten truc-false items ; the 
average person will get 5 of them right, and one-half the 
group will get either 4, 5, or 6 right. Only about one 
person in а thousand will be so lucky as to get 10 right, 


1 The situation is effectively the same as getting the persons each 


to toss a penny ten times, and countin em 
obtain 0, 1, 2, . . ., 9, 10, heads. Б up how many of th 
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eem as to hit on none of the right answers. I 
о ^X correction is applied, the scores corresponding 
imd 4, and 0 right will be +10, + 2, 0, — 2, and 
ER es ма ge In a longer test the range of chance 
нивн € may arise from guessing will be still greater, 
Undoub п ra average (when corrected) will still be zero. 
MN edly, then, we must expect true-false tests to be 
sts vhat unreliable, but the effect will be less marked in 
E t реа of three- or four-response items. Experi- 
Бану evidence does indeed show true-false tests to be less 
КЫЫС than multiple-choice tests ; and the latter are less 
Cie P than tests made up of recall items. But the differ- 
by tt etween them is not large, and it may easily be offset 
as he fact that a true-false test can contain more items 
of n a multiple-choice test which occupies the same length 

time, since true-false items can be answered ' more 


quickly, 

Now, though we must allow that a considerable error 
May occur among the scores of а small proportion of 
examinees on account of guessing, іп actual practice 
entirely random guessing (which was assumed in the pre- 
vious paragraph) will very seldom occur. The examinee 
15 more likely to possess an incomplete knowledge of cer- 
tain items, sufficient, however» to make his guesses right 

as unfair that such an 


ones, Teachers may regard it 5 
examinee will obtain the same marks on those items as 


Will the more able examinee who really knows the right 
answers, But а brief stud type questions, 


y of recognition- 
Such as those given above will show that the examinee's 
Shaky knowledge may 


not always be to his advantage. 
For the alternative (wrong) answers are usually worded 
sufficiently plausibly to deceive the weak student. Hence 
his ‹ semi-guesses ' wil in practice be very frequently 
Wrong. For this reason experimental researches indicate 
that the guessing correction fully compensates, or even 
over-compensates; for any advantages which the weak 
student might gain by guessing. Other investigations 
prove that the amount of guessing may be greatly reduced, 
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у and the reliability and validity of the test slightly m- 
creased, by instructing examinees not to guess, and by 
warning them that wrong guesses will be penalized. | ^ 

To sum up: random guessing may considerably distor 
the marks on a recognition-type test, especially when the 
questions provide only two alternative answers. It does 
not actually do so to any great extent, partly because 
guessing scarcely ever is random, partly because it can bg 
reduced by appropriate instructions, and partly because 
such tests can include large numbers of items which mala 
for increased reliability. If the recognition-type of item 
1s to be condemned, it must be primarily on account 0 
some of the other reasons discussed below. 

Suggestive Effects of Wrong Hesponses.—Another сопло 
Criticism which we can refute is that the presentation 9 
false' statements 'and wrong answers to the suggestibh 
minds.of pupils is pedagogically unsound. It is allege 
that they may absorb such statements and later on believe 
them to be correct. This theory is an a priori one, 2n 
experiments which have been devised to try it out e 
always failed to provide confirmation. Probably the fac 
that examinees know that about half the true-false state 
ments, and more of the multiple-choice responses, 219 
wrong is sufficient to offset the effects of suggestion: 
Many teachers find that there is considerable pedagogic 
value in getting pupils and students to correct their oW” 
Papers shortly after the test, so letting them see which 
Tesponses are wrong (ef, Ballard, 1923). 

The Elementaristic Tendency in New-iype Tests. —We 
come now to the most obvious and most frequent objectio? 
to new-type tests. Tt is said that they measure only the 
most trivial aspects of ability, such as details of informa- 
tion and fragments of knowledge. Such tests may indee 
analyse complex mental operations into elements which can 
be objectively marked, but they cannot put the elements 
together again, and in the process of analysis they omit 
many of the most important aspects of ability. They 
cannot, as сап the essay-type of examination, show the 
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B general understanding of the subject, nor His 
каре апы. of facts, his capacity for organizing and 
alit js his knowledge, nor his initiative and origin- 
E ar from reducing the tendency of examinees to 
E mechanically, they may increase the amount of rote 
orization and discourage any desire to achieve а 
cap grasp of the field. Investigation has, indeed, 
own that students revise their work in а different manner 
when they know they are to sit a new-type examination, 
and that they concentrate on acquiring such bits of know- 


ledge as are most likely to appear in the test questions. 
Such criticisms embody a number of half-truths, and 
ments. The retort 


may be countered by a number of argu 
may first be made that even if the new-type test cannot 
Measure ‘understanding, originality, etc, the essay 
examination cannot do so either. For when examiners 
come to assess such qualities, they disagree widely with 
one another. The orthodox examination achieves а reason- 
able degree of reliability when it most close approximates 
to a new-type test, i.e. when it asks primarily for factual 
data, or when the examiners draw up а detailed analytic 
scheme of the points which they wish to mark. Secondly, 
it may be pointed out that examinees engaged on an essay 
examination do not usually shine in regard to organization 
of knowledge and original thinking. Rather they dash 
down all the ideas and facts which they can recall, many of 
them irrelevant. ' Thinking ' 25 contrasted with ‘ repro- 
duction of items of information ' is at а minimum. The 
new-type test taps all the relevant knowledge, with much 
less waste of time, and prevents the candidate from pro- 
ducing irrelevant material. | 
Such recriminations are also, however, only partially 
justified. Though often true of present-day examinations, 
they would be less applicable if the recommendations made 
in Chap. XIV were put into effect. Much sounder is the 
argument which points to weaknesses 1n the critics’ psycho- 
logical analysis of the contrasted examinations. They 
commonly draw = sharp distinction between information - 
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2nd reproduction, on the one hand, and understanding and 
thinking, on the other hand. "They assume, that is, that 
these mental processes are uncorrelated. In actual prac- 
tice the two cannot be separated so readily. The acquist- 
tion and reproduction of information always involves a 
certain amount of thinking, and no thinking is possible 
Unless а person possesses information to think about. 
Hence, experiments in which attempts have been made to 
measure these faculties separately have usually shown them 
to be highly inter-correlated. Tt follows, therefore, that 
even when а new-type test is apparently measuring nothing 
but information, it is at the same time providing a pretty 
good measure of other more complex types of ability. It 
cannot ask examinees to ‘Discuss . . Compare . . ~ 
etc. nor make any overt provision for independent think- 
Ing, yet it is certainly drawing to a considerable extent 
"pon these * higher ? menta] processes which the orthodox 
examiner desires to assess. Conversely, a large propo 
tion of the average essay-type answer consists of the sort 
of material which critics affect to despise when it is 


response, or a sentence in an essay answer, may depen 

upon *understanding? in one examinee, but may result 
from ‘facile reproduction > in another examinee, according 
‘to the respective methods by which these examinees have 
studied the topic in question. Nevertheless, it seems 
legitimate to make a Partial distinction between the two 
types, and to expect future experimental investigations 


English at all if new-type examinations becam le 
Н 2 ier the so 
method of testing their abilities, зше 
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u^ reason why the criticisms of triviality and frag- s 
Lx n seem at first sight so justifiable and so serious 
T ecause many new-type tests are incompetently con- 

ructed. Questions which ask for bits of information are 
much the easiest to set, and, therefore, do only too fre- 
гу constitute the bulk of home-made examinations. 

ut there is nothing inherent in new-type examining which 
makes better questions unattainable. And the writer 
would claim that at least half the questions in his own test, 
above, do demand not only the possession of information, 
Ent also the application of principles or the interpretation 
of information; that is to say, ' understanding.’ 

To conclude, then, the defect which we have been dis- 
Cussing is one which may occur in new-type tests, but need 
not do so to any great extent. Its effect upon the validity 
9f the tests has been greatly exaggerated by critics owing 
to their faulty views of the nature and organization of 


menta] abilities. ; 
Defects Due to the New-iype Medium of Expression.—Our 
Next objection is of a different order. It does not try to 
Point out something which new-tyPe tests fail to measure, 
ut rather to show that they measure & good deal over and 
above the abilities which we actually want to assess 5 and 
that this extra factor, having little educational significance, 
Seriously distorts and reduces the validity of all'new-type 


test results, Some of the most ardent advocates of new- 
; do not seem to have taken 


type testin Ballard (1928), З à 
this ата а асе An excellent account of it 
May be found in Hawkes and Lindquist (1987). А 
ТЕ the reader will analyse carefully the processes which 
take place in his own mind when answering the new-type 
questions above, he is likely to fingja good deal qug Saa 
besides the recollection and application of his knowledge 
of the subject. He їн i gdy, the и qe 
di ias to do in each question, an en 
iscover what he ha уау as to fit it in with 


i i i h a v 
manipulate his knowledge in 557 ; 
the uet of the question. In a multiple-choice’ 


question he may arrive at the right response, not because 
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* he 15 certain that he knows it, but because he has eliminated 
the incorrect ones, Sometimes a single word in the ques- 
tion or in the Provided response will enable him to choos 
correctly, without his having to think out the full implica 
tions of all that is printed. Some questions may, unknown 
to the author, contain clues to the correct responses which 
go not require any knowledge whatsoever of the subject: 
But if the reader is a sufficiently sophisticated new-type 
examinee, he will at once take advantage of such faults 
In construction, Examples of these irrelevant clues are 
given in the next chapter. Unfortunately, the questions 
Which are most liable to contain such errors are those that 
пала evokes t understanding > rather than * reproduction: 
Straightforward questions which ask primarily for factu? 
information (e.g. Nos. 27-88 and 42-47) are generally fre 
fronithem. Butinore elaborate questions (e.g. Nos. 34-4} 
48-54) involve a great deal of ‘ disentangling ° and complex 
inference. The ability to do such questions may depe” 
rather largely on the examinee's previous experience а) 
Practice with such tests, or upon some unknown group facto” 
and not solely on his knowledge of the subject being testec* 
Such sources of unreliability and invalidity may be muc 
reduced if the examiner is sufficiently skilled in anticipating 
all the diverse mental processes which examinees may 
employ, but they are likely to be eliminated only if he falls 
back on questions of the simple-recall type, and asks mere у 
for 5сгарз of information, Actually no examiner can fore 


yond the comprehension of 
quently just guess, and oft 
answer may contain some 


the poor students, who cont 
en guess rightly. Or the соттес 
unsuspected ambiguity which 2 
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not noticed by mediocre students, who therefore choose it; 
but which is noticed by good students, who therefore reject 
it. Such faults in construction are still more likely to 
occur in tests devised by amateurs. 

It was shown in the previous chapter that many of the 
defects of the orthodox examination derived from the fact 
that examinees have to formulate or express their know- 
ledge and ability in a distorting medium, namely essays, 
and that examiners disagree in their interpretation and 
evaluation of these complex mental products. We see now 
that much the same is true of new-type examinations. 
Examinees have to express their knowledge and ability in a 
different but perhaps still more artificial medium. This 

roundaboutness,’ and the complexity of the mental pro- 
cesses involved in arriving at their answers, undoubtedly 


reduce the validity of their marks. : R 
The Subjectivity of New-type Tests.—A final defect in 
new-type tests, which has been entirely neglected by the 
majority of writers, is that they are, in a sense, just as 
Subjective as essay-type examinations. , Only the personal 
element enters, not in the marking, but in the setting of the 
questions. Pullias (1937) has recently demonstrated that 
When two or more examiners construct their own new-type 
test papers, intending to cover precisely the same field of 
knowledge and ability, and apply them to the зате pupils 
correlation between the two sets 

der of + 0:50. This figure may 


or students, the average 
e examiners Were teachers who 


р marks is only of the ог 
e un use th В 

ашу low beca led in test construction. Yet 
hers took part,in the experi- 


may not have been highly ү 

Some thirty- ifferent teac 

не асы habitually employed new-type tests, and 
the results were very similar in different school subjects, 
апа among pupils of various ages. Pullias further found 
that standardized educational tests, which were alleged to 
measure ability in the same school subjects, only yielded an 
average inter-correlation of + 0:68. In some instances the 
coefficients were decidedly higher (+ 0:8 to + 0-9), in 


others very much lower (+ 0-2 to + 0:3). 
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+ How may we account for such results ? Аеш ш 
Sources of unreliability described on pp. 224-88 apply here, 
too. These were: ion fluctua- 

1. Actual changes in the examinees, or function flu 
tion. опей (р: 

2. Inadequate sampling. As already mentione nif 
238), this should be much less prominent in new- и. 
essay-type examinations, especially if the instruc M. 
given at the beginning of the next chapter are ы d 
The trouble is not so much lack of thoroughness О 
questions as the subjectivity of their selection. E. 

3. Statistical defects, "These can arise, not ресин 
different examiners adopt different scales of mao 
because the average level, and the range, of di neies 
among their questions may differ. Such inconsiste есь 
can, however, readily be eliminated, пог did they а 
Pullias's correlations, 


2 - 
opinion determines which characteristics of the у 
Shall be assessed as good or bad. The new-type in ES 
tion is Subjective because the examiner has to perio da 
similar process of analysis in choosing questions, ч the 
Providing ‘the good and bad answers between whic ine 
examinees are to discriminate, Again, the essay-exam 


n © same way that different essay” 
examiners will translate differently the answers to an essay 


n at the main Stages in examining— 
(4) constructing questions, (b) answering questions, (^ 


evaluating answers—in Teality constitute a single whole- 
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The new-type examiner has, indeed, eliminated subjectivity ‹ 


(а). And 


from (c), but at the cost of greater subjectivity in 
since his 


ов opinion has inserted itself into (b), 
Ar ions embody a good deal of what, in essay-examina- 
, is normally carried out by the examinees. 
Bow that we recognize this subjectivity as the chief flaw 
it new-type tests, we should be able to control and reduce 
more readily than in essay-examining. Nevertheless, we 
aid certainly conclude that both types of examination are 
mperfect mental measuring instruments, and open to 
a the same objections. Conclusive evidence as to 
ir relative validity is not easy to obtain, since usually 
It 15 only possible to compare their marks with the marks 
9n subsequent, equally fallible, examinations. But by 
pooling the marks from several papers of both types, from 
Class teachers? estimates, ete., a fairly good criterion^may 
be achieved. And according to this, neither the *essay- 
type nor the new-type is definitely superior (cf. Lee and 


Symonds, 1933). The evidence of the superior validity of 
forward is quite 


Dew-type tests which Ballard (1928) puts 
unconvincing. Certainly there is no proof that the 
application of new-type tests in university entrance ог 
Scholarship examinations would give appreciably better 
f ultimate university success than 
which reduce the 


a poorer predictions 0 4 
present. Since, however the errors 
bet different, it follows that they 


Validities of the two аге 

measure somewhat different aspects of abili 

combining them the respective errors partially cancel one 
lidity is definitely superior to 

that of either of the sé types. And as certain 

features of a scholasti 

by the one, other features by 

employed in the spheres to whic! 


bility. Hence by , 


CHAPTER XIII 
CONSTRUCTION OF NEW-TYPE EXAMINATIONS 


Tut first stage in the construction of a new-type exuti 
tion should be the preparation of a detailed statement 0 
what the examination is intended to measure. A list "i 
the topics which it is to cover should be written out, p 
а rough assessment made of their relative importance, fo 
example on a 1 to 5 scale. Suppose that the sum of a 
these assessments comes to 75, and that the examination 
is to last two hours. Probably about 150 questions TE 
be needed. In that case each topic assessed as l In = 
portance should eventually be covered by two que 
each assessed as 5 by ten questions, and so on. Natura 7 
this distribution of questions need not be enforced meehad 
cally ; but the principle of allotting most questions to t | 
most significant parts of the field is important. If the аг. 2 
is designed simply to show the acquaintance of rud 
with a particular textbook, the questions dealing with he 
topic may ‘be proportioned to the number of pages on t 
topic in the textbook. ld 

“We have already stressed the point that questions shou E 
not deal with trivial details merely because these are t 
easiest to set. It is very unwise also for teachers to take 
sentences straight out of textbooks and turn them into 
true-false or completion items, If this is done, the 
examinees will assuredly begin to study their textboo 2 
with an eye to such sentences, and will try to learn them. 
off by rote rather than to understand them. Textboo 
sentences may occasionally be adapted as wrong answers 
in multiple-choice questions, since they may then trap the 
parrot-learner. 

When the teacher and the examiner are one and the same 
258 
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e : А : ч : 
person, suitable questions, or ideas for questions, are likely , 


pre him while he is conducting the work of the class. 
eem ould be noted down and filed ; they will save him 
ho. n and effort when the examination has to be made 
Ө л examiner should expect to be able to devise а 
Bid plete test at one sitting, unless it is intended to last 
: y about a quarter of an hour. He will soon discover 


for hi й 
or himself his speed of work, and will set aside sufficient 


Ours, on а number of different days, to enable him to 


с Е 
атту out the construction efficiently. The time needed 


Will, of course, be much reduced if he can take over some 
which he has set in 


p questions from previous papers : 
Et same subject. This is often possible since the papers 
‚ ате usually collected at the end of the examination, and 
Not allowed to circulate freely round the school or univer- 
n ` Before he enters upon the final stage of construction, 
е should have in hand roughly double the number of 
Questions he is likely to need. Many will have to be 
rejected for one reason or another, henee he requires plenty 


of spare ones. " M 
di The choice of questions must be governed by an entirely 
different conception of difficulty from that which prevails 
m traditional examinations. "4n accordance with the 
Principles given in Chap. II, the test should be so arranged 
that the poorest examinees will obtain marks very little 
above 0%, the best ones very little below 100 %, and so 
hat the average will ђе as near 25 possible to 50%, 

(These figures may then later be converted into some more 
Conventional scale.) A few questions, therefore, must be 
50 easy that almost all (say 95%) can manage them, a few 
So difficult that scarcely апу (say 5%) can manage them. 
oreover, all intermediate grades of difficulty must be 

н t, all the questions in 


adequately re resented. Ву contras 
д звери P'xamination are usually arranged to be 


roughly equivalent jn difficulty. Tt follows that, in а new- 
type examination, several sections of the work may be 
omitted altogether if they ате likely to be known so well 
that more than 95% of the students would answer ques- 
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tions on them correctly. It is a pure waste of time to 
include items which everyone can do. Conversely sections 
of the work may have to be omitted if even the best 5% 
are unlikely to be able to answer questions dealing with 
them; such questions would also be useless. 

Thus the examiner's next step must be to grade each of 
his tentative questions for difficulty, i.e. to assess what 
Proportions of examinees are likely to get each one righ 3 
The grading may be quite coarse, e.g. А — 5 to 2010 
В =21 to 40%, C = 41 to 60%, D = 61 to 80 iy 
E = 81 to 95%. These estimates are likely to be highly 
inaccurate at first, and to diverge widely from simi 
estimates made by other examiners. But if he takes t E 
trouble to check his predictions after each paper, and ng е. 
how many examinees did actually pass each item, та 
capavity will soon'improve. The final version of the paP® 
should: contain approximately equal numbers of A, В, '^ 
D, and E items. t 

What partieular form of new-type question to adop ~ 
depends entirely upon the subject-matter. Some ku 
are more readily expressed in one form, some in anot у 
No certain evidence of the superiority or inferiority of а 
Туре іѕ аѕ yet available. Experiments on this matter hà 5 
been carried out ; but the skill of the authors of the i 
has not "been controlled. Suppose, for example, t 
multiple-choice items are found to be more reliable M 
valid than true-false ones, this may merely show that | 5 
experimenter is himself better at compiling multip* 
choice items. Nor can we definitely claim that differ 
types measure distinctive abilities. The correlation 
between two multiple-choice tests, or two true-false testo, 
do not appear to be higher than those between a multiple” 
choice and a true-false test, This indicates that they bot 
measure the same thing, 

There may, however, stil] be some advantage in probing 
the mind by a variety of techniques. And several different 
types may be employed in a long paper so as to reduc? 
monotony. Ina short Paper this is undesirable, since e8€ 
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type needs to be prefaced by full instructions, and examinees 
p une to spend too large a proportion of time in reading 
The Pe pi. and in readjusting their ' mental set.’ 
йа ollowing rule may be taken as а rough, though not a 
oe , guide. If matching items are included, there should 
E at least five of them, each including 5 to 10 responses. 
| multiple-choice items are included, there should be at 
east ten of them. If true-false, simple recall, or com- 
pletion items are included, there should be at least twenty 
р each. The numbers of other types of questions should 
€ comparable to these figures. 
When the questions have all been chosen they should be 
Tearranged so that all those of any one type are grouped 
together. In each group the questions should be arranged 
in estimated order of difficulty, starting with the easiest. 
f this precaution is omitted, the poor and médium 
examinees may waste a lot of time attempting difficult 
items at the start, and fail to reach easier ones which they 
could do, and the reliability of their results will be reduced. 
We will now consider each of the main types- 


SIMPLE RECALL AND OPEN-COMPLETION TYPE 
A question or short problem is followed by а. blank space 
Where the answer is to be written. Alternatively certain 
words or short phrases are omitted from a sentence or 
Paragraph, leaving blank spaces to be filled in. For 
example :1 
la. Who was the inventor of the steam engine? ........ 

Тв. The inventor of the steam engine Was ........ 
Yo ? 


9 .. dy 
2. If y = 22° + 5 + 9, What qu 
8. Ohm’ tates һа... 
4. Ораше E is dissolved in water the molecules break up 
1 The author feels that an ар 


ology may be neged for the puerility 
of som illustrative questions in this С apter. They are in- 
тыд оа the application of the technique to various school 
and university subjects, at varying Jevels of difficulty. Preferring 
to construct fresh ones rather than to quote from published tests, 
he is limited by the rustiness of his own education. 


18 
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Into eius... А. positive electric charge is carried by the 
oce ‚ а negative charge by the ........ А balanced 
action takes place whereby XY С” ........ THE: сене. 


theory provides 
by salt solutions. 

5. А geography test may consist of а map on which num- 
bers, but no names, are printed. Below is à column of these 
numbers and the examinee is instructed to write opposite each 
the town or river which it represents. 

Each correct answer receives one mark; thus 6 marks 
may be scored on Question 4. 

The advantages of this type of item are, first, its natural- 
ness and its similarity to ordinary questioning ; secondly, 
its freedom from the chance guessing effects which occur 
in closed or recognition items ; and thirdly, the facility 
with which it may be constructed. This facility, however, 
is somewhat delusive, and unsuspected flaws are very 
likely to creep in. The chief disadvantage of the type 
is that its scope is practically limited to rote knowledge; 
or skill at simple arithmetical and scientific problems. 

A number of hints on construction may be listed as 
follows. " 

(a) Try to ask only for important items of knowledge. 

(b) Ensure as far as possible that there is only one right 
answer, and therefore confine the answers to single words 
or numbers, or to very short phrases. Some of the blanks 
in Question 4 violate this rule. The last one might be 
filled by * conduction,’ * carriage, or ‘carrying.’ If 
there is any likelihood of alternatives, the examiner should 
list them all beforehand, and decide which ones to allow 
as correct. In this instance he would have to decide 
whether to disallow ‘ passage? Examiners may find it 
easier to conform to both these rules if they think of the 
answer first, and then build up the question or sentence 
around it, in such a way that the answer is completely 
delimited. [ 

(c) Keep the questions and Sentences as short and as 
uaambiguous as possible, and in partieular avoid over- 
mutilation of а, completion sentence by inserting too many 


an explanation of the 


| 
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blanks. A completion sentence should always be looked^ 
on as a form of question. See therefore that sufficient 
words are left to make it an intelligible question. For the 
same reason, do not put blanks near the beginning of а 
sentence, which can only be answered by consulting later 
clauses of the sentence, or later sentences in the paragraph. 
Most of the blanks should be at the end of the sentences. 
(d) Study the whole content of a sentence or question 
to see that all of it is essential for the production of a cor- 
rect response. In badly constructed items, it may be 
possible to deduce the response simply by the exercise of 
intelligence, without any knowledge of the subject-matter. 
In others the correct response may be given away some- 
where in а later sentence ; OT hints may be derived from 
the grammatical construction. For example, blank spaces 
should not be preceded by ‘а > ог ‘an.’ An irrelévant 
clue is provided by the ‘an’ in the following example : 


6. A piece of land entirely surrounded by water is called an 


This difficulty could be 


tence as a question, ог by 
(e) Never vary the length of the blank space according 


to the size of the answer expected. Do not even indicate 
by two or more lines that the answer contains two or more 
words. These practices would also supply irrelevant clues. 
Sce that the spaces are big enough for examinees with largë 
writing. 

(f) For purposes of casy marking, it is desirable to 
arrange for answers to be written in the right-hand margin. 
ТЕ is much more difficult for the eye to pick out answers 
scattered all over the page. Even the following is rather 


inconvenient : 


E 
overcome by wording the sen- 
printing * BT) s n. n ка 


Write down the author of each of these plays. 
7. The Admirable Crichton ........ 
8. Candida ..:::''' 


olution is to number 


Ones the blank spaces and to print 
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га separate column of numbers opposite which the answers 
are written, thus : 
Write down the authors of each of these plays. 
7. 


7. The Admirable Crichton — T. ........ 
8. Candida 8 


A paragraph-completion test, with scattered answers, d 
be scored more efficiently by constructing a stencil of car 4 
board, with holes through which only the answers is 
visible when the stencil is laid over the test paper. Unde 
each hole on the stencil is written the correct answer. 


TRUE-FALSE ТҮРЕ 


This generally consists of a set of statements, approXi- 


mately half of which are true, the rest false. The wees, 
has to indicate which is which. In a spelling test t У 
‘statements ? may be reduced to single words, је 
correctness от incorrectness isto be checked. Forexample: 


Write a + after each of the following words which is correctly | 


spelled, and a — after each one wrongly spelled. 
9. Medecine ........ 
10. Embarrassment ........ 
11. Accelerate ........ 
12. Sieze......... 
18. Advertisement ........ 


A modification which is useful for tests of sp pa 
grammar, and literary knowledge is to instruct t " 
examinee to cross out, or underline, the one word in Mt 
of a set of sentences which is ungrammatical, or wrongly 
spelled, or wrongly quoted. One example of each is given- 

14. I met him and she in the town. 

15. The ruffian demolished all the aparatus in the laboratory- 

16. Rule, Britannia ! Britannia rules the waves. 


The ordinary true-false type has been commonly used 
in tests of all Subjects, because it is capable of sampling 
+ 1 These should perhaps be classed as multiple-choice items, since 


the examinee has to pick out the incorrect word from about half а 
dozen correct ones. 


CONSTRUCTION OF NEW-TYPE EXAMINATIONS 205 


very quickly а wide range of knowledge, and because 

examiners find it easy to construct. But the uncertainties 
due to guessing (cf. рр. 247-50) and, still more, its liability 

to unsuspected errors, have recently made it less popular. 

Some recommend that it should only be used when the 

multiple-choice type is ruled out for lack of plausible 

alternatives. Thus the following are justified since each 

involves only two possible answers : 


17. Distance east or west of Greenwich is 
known as longitude. True False 


18. When summer time starts the clocks 
are put back one hour. True False 


e true-false items to rote in- 
formation. Indeed, questions of general principle, of 
interpretation and the like, can. well be expressed in this 
form. Great care is needed, however, to see that thcy do 
not violate some of the following principles of construction. 

(a) Statements should be simply worded, otherwise they 
are liable to contain sections which are true and sections 
which are false. In general it is best to use sentences 
which contain only а single clause, embodying only a 
single idea ; to word them positively ; and to refrain both 
from negatives and from conjunctions such as * because 3 
and ‘therefore.’ The following examples are ambiguous, 
and therefore unsatisfactory. 


19. Napoleon lost the battle of Waterloo 
because his army had insufficient ammunition. True False 


20. Beethoven was à great composer who 
wrote many symphonies, operas, and sonatas. True False 
The most crucial part of a statement, which renders it 


true or false, should usually come near the end. 

(b) It is not easy to construct difficult statements, nor 
to estimate their degree of difficulty beforchand. They 
need to be sufficiently plausible to deceive the examince 
whose knowledge is incomplete, and therefore not too 
obviously true or ta se. And the true ones should be just 
as likely to appear false as the false ones truc. At the, 
same time they must ђе clearly true or false to the able 


There is no need to confin 


E 
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„examinee, and therefore must not contain ‘tricks.’ 
Good students look out for essentials, and so may be tripped 
up if errors are introduced into minor details. "The follow- 
ing example is bad for this reason. 


21. Water boils at 212° and freezes at 32? C. True False 


(c) It has been found that certain forms of wording are 
much more apt to be used in true than in false statements, 
or vice versa. Examinees soon gct to recognize these, and 
So may answer correctly without any knowledge at all of 
the subject-matter, The great majority of statements, in 
amateur new-type tests, which contain the words : ‘А, 
* Always,’ * Never, * Impossible,’ are false. The majority 
of those which contain ‘ Usually,’ ‘ Often,’ and the like, 
are true. Most very long statements also tend to be true 
more, often than false. There is no need to discard the 
above words altogether, but they should be used with 
caution, and, if possible, introduced as frequently into true 
as into false statements. 

(d) If the number of statements is less than, say, twenty, 
there should not bé an exactly equal division betwen true 
and false, else examinees will count up to make sure that 
they have checked them half and half. 

(e) The order in which the statements appear must, of 
course, besentirely random. If the examiner never prints 
more than two true or two false statements consecutively, 
examinees may again discover this and take advantage of 
it. One way of arranging is to toss a coin. If it shows 
heads, take the easiest of the true items as No. 1. If the 
next four tosses are tails, take the four easiest false state- 
ments as №5. 2 to 5; and so on. 

(f) Various methods for recording the responses have 
been used: underlining the words True or False, or Yes 
or No, drawing a circle round the letters T or F, or writing 
the letters T or F. One of the simplest and most satis- 
factory is to write a + ora — ina blank space at the end 
of cach statement. These spaces Should all be aligned in 
a straight column in the right- or left-hand margin. 


"im 
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(g) The correction for guessing, described on p. 248, 
should be applied, and examinees should be warned of this 
in the general instructions at the beginning of the test. 


MULTIPLE-CHOICE TYPE, INCLUDING Brest REASON AND 
MATCHING ITEMS 


A large number of variants of this type may be dis- 
tinguished, examples of most of which are quoted below. 

(i) The number of alternative responses is usually five, 
but it may range from seven to three, or even two, accord- 
ing to the number of possibilities implicit in the question. 
Six- or seven-response questions occupy rather a lot of 
space. Three- or two-response questions should usually be 
avoided because they require the application of guessing 
corrections. In any one new-type test there is no necessity 
to provide the same numbers of responses for all the 
questions. 

(ii) When the responses consist of single words, they can 
be printed all on one line, consecutively. They should 
be enclosed in brackets, and preferably be printed in 
italies. The examinee is then instructed to underline the 
one word in each set of brackets which is correct. For 


example : 
22, They (was, were) going (to, too) London (100,10) visit (their, 
there) aunt, (who, what) lived (there, their). 


When, as in most of the remaining examples, the responses 
consist of several words or phrases, underlining is trouble- 
some both for the examinee and for the marker. The 
responses may sometimes be printed consecutively and 
numbered, and the examinee writes the number of his | 
choice іп а plank space. For example : 


23. W 
system ? 
control post 
the activity 
sphincters 0 
of the alimentary canal. 


hat is the function of the sacral autonomic nervous 
1) To increase circulation and respiration. (2) To 
ure and movements of the limbs. (3) To increase 
of the reproductive organs. (4) To contract the 
f the excretory organs. (5) To inhibit the activity 
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This type of question is, however, inconvenient to pes 
hence it is more usual, as in the examples below, to Pu 
the responses in a column. The examinee indicates hi 
choice by a cross in one of the blank spaces in the ws 
Unfortunately this method tends to consume a lot o 
space. Р j 

(iii) The items тау be stated in question or in com 


pletion form. Thus Question 28 might equally well 
commence : 


В т 
28. The function of the sacral autonomic nervous syste! 
iss (1)... 


(iv) In the * best reason ’ type, the examinee is insteuoted 
to check the best of à Set of reasons for a given statemen . 
Occasionally а more suitable item may consist in finding 
the Worst reason. For example : 


have reared them 3 5 А P аг ех ea 
The average difference between the 1.0 of 
pairs of identical twins is higher among those 
reared apart than among those reared to- 
« gether . s Ta de ч А * 
Children tested before placement in foster 
omes, and again after several years, show a 
Ereater increase in intelligence іп the better › 
than in the poorer homes . " Э оня 
Children of Professional parents are on the 
average markedly Superior in intelligence to 
children of unskilled labourers . А эз que ЖОЕ 


(у) Two or three responses, instead of one, may be cor- 
rect, Occasionally none of them may be correct. The 
instructions Should make it absolutely clear to the 
examinees what is wanted. For example : 


<= "о 
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25. Put a cross opposite two of the following English painters , 
who are especially famous for their landscapes : 


Constable ..х.... Reynolds — .......- 
Hogarth ........ Romney РТ 
Lawrence ........ Turner ЕЯ 


26. Put a cross opposite one answer only. 
Sir Thomas Browne is best known as the author of : 


Fiction 3 — eh ЕТ: 
Plays . = ETE TP 
Poetry. А а чав е 
Works of travel . .. 8 
None of these є We Mee 


(vi) The same set of responses may apply to several 


questions. For example : 
Identify the following orchestral instruments by drawing a 
circle round 5 for strings, W for woodwind, B for brass, and P 


for percussion : 
97. Clarinet - Q B Ф 
В 


5 КӨ 
28. Triangle · . . 
29. Double bass . . P 
80. English hom > ' 5 ® Е 


81. French һоп · 
re an extensión of the previous 


(vii) Matching tests & x 
category. A number of questions and a number of 
responses are listed in different order, and have to be fitted 


together. For example : ; | 
32. On the left is a list of Treaties or Peaces, on the right a list 


ом Jopments which were associated with these 
ot Histo Sad of these developments write the letter of 


ich it corres onds. Note that some of these 
ius ap d Mes 4 at all, ER that some may be used twice. 


letters need not be used A vical Development : 


f: } $ 

per ru o Napoleonic empire broken up sd elisa, 

b) Frankfort League of Nations establish.d vein niim 
(3 Paris German annexation of Alsace- 

чүй Lorraine ADMIS 

(d) Bs nt France regains Alsace-Lorraine ..../... 
"i Versailles Independence of Holland finally 

А cognize па ата A, 
) Vienne а ain becomes supreme colonial 

(h) Westphalia r ower E 


Western powers set the stage for 
Balkan wars eee 
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Many matching tests have equal numbers of items in both 
columns, each item corresponding to one, and only one, an 
the other column. Since, however, this makes possible the 
discovery of responses by a process of elimination, the 
above plan, with unequal numbers, is superior. The e. 
bers of items should usually be five to twelve in eac 
column; a greater number involves waste of time in 
searching. When possible the items in at least one jme 
Should be in alphabetical order, to facilitate the search. 
It may be noted that this type is more compact than 
ordinary multiple choice, and so saves considerable Spe 

The multiple-choice type of question is the most genera! y 
applicable in educational testing. Not only factual ME 
formation, but also quite elaborate reasoning and x 
pretation ean be measured by it. In the best-reason kon 
for éxample, the“ alternatives provided may be grade | 
reasoliableness ; but only one answer should be ARP ME 
Correct. Again in matching tests, the examinecs сап f 
required to apply a series of principles to a series E J 
problems. Very great care, however, is needed in the ¢ 
struction, and the following hints may be useful. hen 

(a) The multiple-choice type should not be used w и 
simple-recall questions would be adequate, i.e. when E 
is only one definitely right answer to the question. sale 
example, it would be a waste of time to express ar! 
metical problems this way : 


33. з. 28 
7х6 = 1, 13,2. 


Indeed, most mathematical problems are better in ope? 
than in closed form, since the latter may enable the 
examinee to work backwards from each answer until he 
reaches the right question, which is a very different matter 
from working forwards from the question. 

(b) All the responses provided must be sufficiently 
plausible to be selected by a fair proportion of the 
'examinees. Otherwise there isno object in including them. 


The incorrect alternatives can often ђе made up out О 
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misconceptions and errors which are likely to be prevalent, 
among the examinees. Sometimes, also, they will be 
selected by many of the poorer examinees if they contain 


several words or phrases i 


in the question or introductory statement. 
however, be homogeneous 


ength, and other external 
If the correct answer is always longer, or 
than the incorrect ones, examinees may 


and incorrect responses should, 
in their mode of expression, 1 
characteristics. 
always shorter, 


dentical with words or phrases 


Both correct 


discover this and employ it to their advantage. 


(c) Each response sh: 
vant clues. For examp. 
consistent with the question. 
in matching tests. 
heterogencous items t 
match them without any real К 
The following is an extreme ( 
bad item which, although sup 


geographica. 
ence alone. 


84. (a) А geographical line along 
which atmospheric 
ressures are equal 
(5) Scottish lakes or sea inlets 
(c) Winds which blow at a cer- 
tain season in the Indian 


Ocean 
(d) A high piece of land from 
which rivers flow in two 
or more directions , 
(e) Lines on а map showing 
heights above sea-level 


оша be studied for possible irrele- 
le, each one must be grammatically 
This is especially necessary 
Some testers throw together such 
hat the examinees may be able to 
nowleáge:of the subject. 
hypothetical) instante of a 
posed to show knowledge of 
1 terms, could be answered correctly by infer- 


6 


Watershed 


Monsoons 


Isobar 


Contour lines 


Lochs 


© Contour lines ° must be (е), because * lines ? are mentioned 
> and ‘lochs ’ must be either (6) ог (с), 


inte) Ж Monsoons 
since these are 


that ‘ monsoons 


the only remaining plurals. 


It is unlikely 


would be printed so nearly opposite its 


tem, moreover ‹ Јосћѕ ? sounds like ‘ lakes,’ 


onding i 
сар must be (b). (a) and (d) mention * geo- 


hence ‘lochs’ 
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graphical line ? and ‘ rivers,” hence ‘ isobar ’ and ‘ watershed’ 
will probably fit them. ; 

The examiner should remember that a matching test is 
an extended form of multiple-choice item, and should there- 
fore see that for each item in one column there are at least 
three items, preferably four or more itemis, in the other 
column which constitute possible answers. 

(d) Not infrequently a better item may be obtained by 
turning round a partly formulated question into an answer, 
and making the correct answer ihto a question. For 
example : 


85. Emphasis by means of an understatement is called : 
eiosis . $ >к без» 
Hyperbole. . ee 
Euphemism . S^ annata ss 
Metaphor . E Oed 


` This might be re-formulated as follows : 


36. Meiosis denotes : x 
Emphasis by means of an understatement . ... X- 
exaggerated rhetorical expression . зшен 
Substitution of а pleasant for an offensive ex- 
pression . ; = i š s = жаз ми 
n imaginative simile . + 9 а — wy эжекей Ш 


knowledge, would be a Set of phrases from among which 
examinees would select the one showing meiosis. In 
general the examiner should study each item in an attempt 
to anticipate dll the mental processes by means of which 
the correct answer may be reached. If any of these pro- 
cesses are not representative of the ability that he wishes 
to measure, then he should modify the item. 

(e) It should hardly be necessary to point out that the 
position of the correct answer in the series must be chosen 
© entirely at random. First and last places should be em- 
ployed as often as any of the intermediate places. 
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REARRANGEMENT TYPE 4 

A series of items which normally fall into а certain 

specific order is listed in disarranged order, and the 
examinee has to rearrange them. For example : * 

_ 87. Below is a list of waves or rays. Opposite No. 1 in the 

right-hand margin, write the letter (a, ог b, ог c, ete.) of the wave 


or ray with the longest wave-length. Opposite No. 2 write the 
letter of the wave or ray with the second longest wave-length, 


and so on down to No. 7. 
(a) Rays from a sodium vapour lamp uon 


(b) Waves from 8. black kettle containing 2. ....d..- 
hot water  ' 
(c) Gamma rays #8, ones О 
(d) Television transmission waves 4. senis os 
(e) Rays from the green of the spectrum B, 2.6.06 
(f) Wireless telephone waves i Н Е 
wwe С... 


(8) X-rays ; 
The mental processes involved in answering this question 
will naturally depend on what has been taught. If the 
pupils have been provided with such a list, they may answer 
by rote recall ; if not, then they may ħave to perform an 
elaborate process of analysis and re-synthesis of their 
knowledge of various branches of physics. — 

By disarranging the order of phrases in sentences, 
Ballard (1923) provides а test which, he claims, measures 
knowledge of sentence structure and of the canons of 
English style. The time sequence of historical events, the 
sizes of geographical areas, the order of operations in some 

t, grammatical sequences 1n 


chemical or ph; sical experimen 8 i 
foreign ride and the like, can readily be expressed in 


this type. ЈЕ has not, however, been very widely used, 
probably because of thé difficulty of devising а simple and 
just method of scoring (cf. Sims, 1937). Ballard’s (1923) 
plan seems fairly satisfactory- Award one mark if the first 
der why t 

b: ; Te rns det шау ei to write 1 after the item that should come 
frst, Cee the next, and so on. _The reason is that the scoring of 
such а response would be exceedingly time-consuming, unless the 
whole of it was correct. 


he instructions are not simplified 
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item is correctly identified ; thereafter award one mark for 
each item which correctly follows the one preceding it. 
For instance, if an examinee answers Question 37: 


(f) (d) (а) (e) (а) (6) (c) instead of (7) (d) (b) (a) (е) (2) (2), 


(f) is correctly placed first and Scores опе; (d) correctly 
follows (f) and scores опе; (a) should not follow (d), but 
(c) and (д) are in Correct sequence, and so score two; (6) 
does not score, although it is correctly placed last, because 
it should not follow (b). The total score here is thus 4 
marks. 

Certain other minor types of question have been de- 
scribed by Ruch (1929), Rinsland (1988), and others. But 
these all appear to be variants of the simple-recall or com- 
pletion, true-false, or multiple-choice types. 

Final Stages in the Construction of a New-type Test.—The 
ехатшет should re-scrutinize every item, try to eliminate 
all ambiguities, and ensure that the whole of each item is 
going to function, i.e. that every word or phrase will be 
essential in arriving at a correct response, and that there 
is no Possibility of obtaining correct “responses through 
irrelevant clucs (cf. Hawkes and Lindquist, 1987). This 
can be done much more effectively if the questions are sub- 
mitted to other experienced examiners who may detect 
flaws which the author has not noticed. 

Next it is essential to make up comprehensive instruc- 
tiens for each type of question, which will be appropriate 
to the mental level of the dullest examinees, Unless the 
examinees are fairly sophisticated to new-type testing, 
Sample questions already answered should be included. 
In classroom Work it is a good plan to provide three 
Samples, one already answered, to read this through with 
the class and point out why the given answer is correct, 
and then to ask the class to suggest answers to the other 
samples, helping any individual members who still do not 
understand what is wanted. 

~ Attention should then be directed to ease of scoring. If 
all the answers can be aligned in a column in the right-hand 


CONSTRUCTION OF NEW-TYPE EXAMINATIONS 275 


margin, a key can readily be prepared consisting of a sheet , 
for each page of the test, down the left-hand edge of which 
are written the correct answers in the correct positions. 

It might be supposed that as some questions are harder 
than others, they should be allotted more marks. But 
actually many experiments show that weighting is hardly 
worth the trouble. The unweighted total scores correlate 
almost perfectly with weighted totals. A very simple 
form of weighting such as the following may be used: 
award 1 mark to each correct simple-recall or completion 
response, also to each correct response in a matching item ; 
give 1 mark to each true-false or two-response multiple- 
choice item and apply the R-W correction ; give 1 mark 
to each three-response multiple-choice answer, but neglect 
the correction for guessing ; give 2 marks to each multiple- 
choice item where four or more responscs are provided. 

Although scoring is 50 rapid and fool-proof, slips do 
occur, particularly in adding up totals. Either then the 
marker should go through the papers twice, or else let the 
students or pupils re-mark their own papers and ask them 
to bring any errors to his notice. | И . 

Finally the examiner will certainly improve his future 
examining if he takes the trouble to study the difficulty 
and the diagnostic value of each item. A simple method 
is to sort the papers into four piles—the quartez with the 
highest totals, the quarters above and below the median, 

lowest totals. The numbers af 


and the quarter with the 
times each item is correctly answered by members of each 


group, and by the four groups combined, may be tabulated. 
The examiner may then compare the actual difficulties of 
the items with his estimates of difficulty made when 
setting the paper. and may note whether the proportions 
of passes for each item ascend regularly from the lowest 
to the highest grouP- If they fail to do so he should 
attempt to discern the flaw in the item which may be 
responsible for its poor validity. 


CHAPTER XIV 
IMPROVEMENTS IN EXAMINING 


Examinations have far too great a burden to bear. They 
are supposed to indicate, as shown in Chap. XI, a large 
number of only partially related traits and abilities. If 
we could define more clearly what it is that we want to 
measure, and use for each purpose the most appropriate 
instrument, there is no doubt that we could effect a great 
improvement in the efficiency of these instruments. let 
us consider first some of the substitutes for examinations 
which-have been recommended by educationists, and the 
functions which they may serve. 9 

Oral Examinations, Viva Voces, and Interviews.— Oral 
examinations are, of course, much older methods of educa- 
tional measurement than written examinations. Their 
main defects were pointed out when written examinations 
began to take their place, about one hundred years ago. 
It was recognized that the situation to which successive 
candidates are subjected cannot be the same for all, so that 
no sure basis exists for a comparison between different 
candidates. The examiner's tone of voice, the wording of 
his questions, his encouragement or criticism, etc., may 
vary widely from one candidate to another. If, as in 
School inspections, all the candidates are interviewed in 
the presence of one another, the examiner has to розе 8 
series of different questions which are unlikely to be closely 
comparable in diffieulty. In any viva voce the amount of 
time devoted to each candidate is usually so short that 
reliable indications of his abilities can hardly be expected. 
There is the further defect that the Situation may cause 
much greater emotional disturbance than does а written 
examination. The interviewer may actually wish to take 
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qualities of temperament and character into account in his 
assessments, but the emotional qualities which are most 
prominent during the interview are probably not very 
typical of the candidate’s usual behaviour, nor of any great 
significance for educational purposes. When care is taken 
to put the candidate at ease, skilful questioning may, 
indeed, reveal something of his interests, his attitude to 
work, and the like, which are educationally relevant. Oral 
examinations in foreign languages will bring out the candi- 


date’s goodness of pronunciation, but cannot be trusted to 


show his facility in conversation, because of the distorting 


effects of emotion. For the same reason, viva voces in 
other subjects cannot throw much light on the candidate’s 
capacity for formulating his knowledge orally. Theoreti- 
cally they should constitute a. valuable supplement to 
written examinations, since they provide а fresh medium 
of expression for ability. But investigations havt shown 
that judgments based on interviews tend to be still less 

reliable than those based on essay-examination answers. 
Hartog and Rhodes's (1935) study of the Civil Service 
interview 15 particularly apposite. "Sixteen candidates 
were interviewed in the ordinary way by two different 
boards, each composed of four or five experienced ex- 
aminers. The members of each board consulted before- 
hand as to the questions they would ask (but the two 
boards did not inter-communicate), and had before them 
the candidates’ educational records. During the quarter- 
to half-hour interviews the members attempted to assess à 
candidate's alertness, general intelligence, and the suitabi- 
lity of his personality for the Civil Service. After discus- 
final mark. It was found that mem- 


sion they drew up а та 
bers of any one board did agree fairly closely in their judg- 


t the two boards had reached widely differing 


ments, but tha. 2 
conclusions. The two final marks differed by 12% for the 
date, in one case by as much as 31 95, in 


average candi у 
а by as little as 1%. The correlation between the 


two sets of marks was only + 0-41 + 0-14. Except for the 


small number of the candidates, there seems to be no 
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technical flaw in this experiment which would allow us to 
challenge its depressing results. 

Another investigation, described by Hollingworth (1929), 
proved the complete inability of experienced business 
managers to agree, on the basis of independent interviews, 


as to the candidate who would be most suitable for 8 · 


particular job. From such data one might conclude that 


the interview and oral questioning, though essential im 


education for purposes of teaching, are useless for purposes 
of assessing ability or the results of teaching. In the 
course of a school year a teacher can undoubtedly build up 
а fairly comprehensive and reliable picture of each of his 
pupils’ abilities, which may be largely derived from oral 
work. But the conditions ћете are very different from 
those in а brief viva voce. The volume of questioning is 
far larger, the emotional atmosphere less strained, and the 
oral answers are supplemented by written work. In con- 
trast to teachers? judgments of ability, their judgments of 
character are known to be decidedly unreliable, although 
by no means worthless (cf. Symonds, 1931; Vernon, 
19885). | 
Perhaps the best instance of the value of the interview 
is provided by the work of vocational psychologists. · 
advising a child on his career the psychologist applies 
several objective tests of aptitudes, and takes both medical 
and educational records into account. But his information 
as to the child's personality and interests is derived chiefly 
from an interview. This is conducted informally so as to 
put the child at ease, and though it follows no stereotyped 
plan ot questioning, it has, in fact, been designed very care- 
fully so as to elicit the child's main traits (cf. Oakley and 
Macrae, 1937). A further function of this interview is that 
it enables the psychologist to interpret the test results and 
other objective records, and to realize their significance in 
a synthetic view of the child as a whole, Guidance, which 
is based merely on the mechanical application of tests OT 
examinations, is not successful. But there is ample proof 
that the guidance given by trained Psychologists who 
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adopt these test and interview methods does possess а high 
degree of predictive validity. Similar methods are used in 
vocational selection, i.e. in choosing the best from a number 
of candidates fora given job. They аге known to work well 
in this field also, since the psychologist habitually checks 
his tests and the data for which he asks in the interview, 
and sees whether they really do enable him to select 
employees who turn out satisfactorily. 

It would seem, then, that the interview, as at present 
employed in educational circles, is practically valueless, 
but that in skilled hands it may become à most useful tool. 
"There is no reason other than expense and the scarcity of 
ts why the methods of vocational 
selection and guidance should not be applied in education, 
both for choosing pupils most likely to Lenefit from higher 

ising them as to the lines of study for 
which they are best fitted. We would not claim that per- 
fect predictions could be achieved thereby, but they would 
probably be greatly superior to those made on the basis of 


written examinations and ordirary viva voces. 
Intelligence and Special Aptitude Tests —Theoretically, 
intelligence tests should be able to fulfil one of the functions 
to examinations much better than 


commonly assigned 1 | ; 
examinations do at present, that is the selection of pupils 


or students with the greatest intellectual promise, as dis- 
tinct from achievement. Among the common definitions of 
intelligence аге: * Jearning capacity,” * educability,’ end 
‘innate ability apart from the effects of schooling.’ They 

easure objectively the qualities 


should also be able to mer qu 
which essay-tyPe examinations are supposed to elicit, and 


which new-type tests are supposed to be incapable of 
eliciting, such as © originality, power of thinking, grasp of 
Experimental investigations indi- 


eneral principles etc.’ 
g P pie», ave some value for such purposes, but 


te that they do h о 
that day e to be employed with caution. Merely 
because they are given the name—intelligence tests—they 

dict very closely the same thing that 


do not necessarily Pre e 0 gtha 
secondary school or university authorities mean by intelli- 
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gence. They seem to be most useful among younger chil- 
dren. The Stanford-Binet scale, applied at 6 or 7 years, 
correlates quite highly with educational achievement for 
the next few years. Coefficients of + 0-70 or more may be 
expected. Group tests are applicable from about 9 years, 
but, as we have seen, are somewhat less trustworthy. 
Their validity for predicting secondary-school achievement 
is very moderate. Thomson’s (1936) result, derived from 
large batches of pupils, may be quoted, though some 
investigators have reported better agreement. He ob- 
tained an average correlation of + 0-445 between English 
and arithmetic special place examination marks and orders 
of merit supplied by head masters two years later, and a 
correlation of + 0-410 between intelligence tests and this 
same criterion. At later ages the figures sink still lower; 
though American “psychologists claim moderate corre- 
spondehee when tests are applied to incomng freshmen 
and are compared with subsequent university work. For 
three years the writer has applied different group tests 
to university graduate students entering on a one-year 
course for the trainig of teachers. In each year the corre- 
lation with final training college marks in all theoretical 
subjects was close to + 0-80 + 0:04. Verbal tests generally 
correlated about equally with science and with arts sub- 
jects; noneverbal tests agreed distinetly better with the 
former than with the latter. Both types of test, however, 
gave coefficients of less than + 0-10 with final marks for 
skill in teaching, i.e. with the marks which are chiefly taken 
to denote the student’s merit qua teachers (cf. Vernon, 1989). 

The reasons for this decline in the value of intelligence 
tests with age have been indicated in previous chapters. 
They include the greater degree of homogeneity, and 
higher standards of selection, among older pupils and 
students, and the apparent differentiation or specialization 


! The median figure for some eighty such studies summarized by 
Boynton (1933) is + 0-425. But the heterogeneity of students 
admitted to American colleges is usually so great that parallel results 
here would probably be distinctly lower. 
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of abilities in adolescence and adulthood. Because of this 
latter development, attempts have been made to test 
special aptitudes, or group factors (mathematical, lin- 
guistic, etc.), rather than general intelligence, at the post- 
ls. Prognostic tests are intermediate 
attainment tests. They try to 
hematies or English the pupils 
have previously aequired, but how well they can apply 
their intelligence plus acquirements to fresh mathematical 
or linguistic problems. Unfortunately, abilities have 
hardly differentiated sufficiently at 11 or 12 years (the age 
when secondary and technical school pupils are usually 
selected) for accurate predictions to be possible. Never- 
theless, Earle (1986) has obtained good results with his 
English and algebra tests. А good deal of research along 
these lines at the university level has ‘been carried;out in 
America. Thus, at Columbia University, the Tliorndike 
Intelligence Examination—a three-hour test which in- 


cludes educational material—is said to correlate over 
marks. This coefficient 


+ 0:60 with first- and second-year mar} І 7 
is decidedly better than those usually obtained either by 
ordinary intelligen trance examinations. 
Many colleges арр 


primary and later leve 
between intelligence and 
show, not how much mat 


ce tests or by en 
ly ‘ placement examinations * on en- 
trance, consisting of general intelligence, English, mathe- 
matics, science, foreign language, and social studies tests, 
in order to determine the students’ main aptitudes. There 


is room for considerable development of such work in eur 
own schools and universities. бо far very little has been 


published. 


Meanwhile, grouP intelligence tests undoubtedly help at 


the change-over from primary to secondary education, and 
would still be quite useful in public school scholarship 
examinations at 18 OT 14 T. The Ministry of Education 
has already recommended their adoption in all special place 
examinations. Even if Thomson’s figures, quoted above, 
show them to be no better than the usual English and 
arithmetic examinations, yet they are measuring а sone- 
what different aspect of ability from the examinations, and 


